date:20181219

[jira] [Created] (YARN-9154) Fix itemization in YARN service quickstart document

2018-12-19 Thread Akira Ajisaka (JIRA)

Akira Ajisaka created YARN-9154:
---

 Summary: Fix itemization in YARN service quickstart document
 Key: YARN-9154
 URL: https://issues.apache.org/jira/browse/YARN-9154
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Akira Ajisaka


{noformat:title=QuickStart.md}
Params:
- SERVICE_NAME: The name of the service. Note that this needs to be unique 
across running services for the current user.
- PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
{noformat}
should be
{noformat}
Params:

- SERVICE_NAME: The name of the service. Note that this needs to be unique 
across running services for the current user.
- PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
{noformat}
to render correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9154) Fix itemization in YARN service quickstart document

2018-12-19 Thread Akira Ajisaka (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725624#comment-16725624
 ] 

Akira Ajisaka edited comment on YARN-9154 at 12/20/18 6:55 AM:
---

Attached a screenshort: 
https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service
 !Screen Shot 2018-12-20 at 15.54.16.png! 


was (Author: ajisakaa):
Attached a screenshort: 
https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service

> Fix itemization in YARN service quickstart document
> ---
>
> Key: YARN-9154
> URL: https://issues.apache.org/jira/browse/YARN-9154
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Priority: Minor
>  Labels: newbie
> Attachments: Screen Shot 2018-12-20 at 15.54.16.png
>
>
> {noformat:title=QuickStart.md}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> should be
> {noformat}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> to render correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9154) Fix itemization in YARN service quickstart document

2018-12-19 Thread Akira Ajisaka (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-9154:

Attachment: Screen Shot 2018-12-20 at 15.54.16.png

> Fix itemization in YARN service quickstart document
> ---
>
> Key: YARN-9154
> URL: https://issues.apache.org/jira/browse/YARN-9154
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Priority: Minor
>  Labels: newbie
> Attachments: Screen Shot 2018-12-20 at 15.54.16.png
>
>
> {noformat:title=QuickStart.md}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> should be
> {noformat}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> to render correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9154) Fix itemization in YARN service quickstart document

2018-12-19 Thread Akira Ajisaka (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725624#comment-16725624
 ] 

Akira Ajisaka commented on YARN-9154:
-

Attached a screenshort: 
https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html#Deploy_a_service

> Fix itemization in YARN service quickstart document
> ---
>
> Key: YARN-9154
> URL: https://issues.apache.org/jira/browse/YARN-9154
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Priority: Minor
>  Labels: newbie
> Attachments: Screen Shot 2018-12-20 at 15.54.16.png
>
>
> {noformat:title=QuickStart.md}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> should be
> {noformat}
> Params:
> - SERVICE_NAME: The name of the service. Note that this needs to be unique 
> across running services for the current user.
> - PATH_TO_SERVICE_DEF: The path to the service definition file in JSON format.
> {noformat}
> to render correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725606#comment-16725606
 ] 

Hadoop QA commented on YARN-9038:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
36s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 22s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
28s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
45s{color} | {color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} cc {color} | {color:red}  0m 45s{color} | 
{color:red} hadoop-yarn in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 45s{color} 
| {color:red} hadoop-yarn in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 25s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 5 new + 494 unchanged - 0 fixed = 499 total (was 494) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
30s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  3m 
28s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
22s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
24s{color} | {color:red} hadoop-yarn-csi in the patch failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
38s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 30s{color} 
| {color:red} hadoop-yarn-common in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
17s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 92m 22s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed.

[jira] [Updated] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Xun Liu (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xun Liu updated YARN-5168:
--
Attachment: YARN-5168.019.patch

> Add port mapping handling when docker container use bridge network
> --
>
> Key: YARN-5168
> URL: https://issues.apache.org/jira/browse/YARN-5168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Xun Liu
>Priority: Major
>  Labels: Docker
> Attachments: YARN-5168.001.patch, YARN-5168.002.patch, 
> YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, 
> YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, 
> YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, 
> YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, 
> YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, 
> YARN-5168.018.patch, YARN-5168.019.patch, exposedPorts1.png, exposedPorts2.png
>
>
> YARN-4007 addresses different network setups when launching the docker 
> container. We need support port mapping when docker container uses bridge 
> network.
> The following problems are what we faced:
> 1. Add "-P" to map docker container's exposed ports to automatically.
> 2. Add "-p" to let user specify specific ports to map.
> 3. Add service registry support for bridge network case, then app could find 
> each other. It could be done out of YARN, however it might be more convenient 
> to support it natively in YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725584#comment-16725584
 ] 

Yuqi Wang commented on YARN-9151:
-

Thanks [~elgoiri], let me try to fix test, style and add UT for 
UnknownHostException.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>  
> {noformat}
> The standby RM failed to rejoin the election, but it will never retry or 
> crash later, *so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.*
>  So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types (until we triaged it to be in whitelist), should 
> crash RM, because we *cannot ensure* that they will *never* cause RM cannot 
> work in standby state, and the *conservative* way is to crash RM. 
>  Besides, after crash, the RM's external watchdog service can know this and 
> try to repair the RM machine, send alerts, etc. 
>  And the RM can reload the latest

[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725510#comment-16725510
 ] 

Hadoop QA commented on YARN-9129:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  7m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
21s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 
38s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
40s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}125m 43s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9129 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952417/YARN-9129.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  cc  |
| uname | Linux 2b7ce7082449 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e815fd9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test

[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Eric Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725557#comment-16725557
 ] 

Eric Yang commented on YARN-5168:
-

[~liuxun323] Thank you for the patch.  I think ContainerReport newInstance 
method does not need to have exposedPorts as parameter.  This will minimize the 
changes.

> Add port mapping handling when docker container use bridge network
> --
>
> Key: YARN-5168
> URL: https://issues.apache.org/jira/browse/YARN-5168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Xun Liu
>Priority: Major
>  Labels: Docker
> Attachments: YARN-5168.001.patch, YARN-5168.002.patch, 
> YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, 
> YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, 
> YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, 
> YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, 
> YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, 
> YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png
>
>
> YARN-4007 addresses different network setups when launching the docker 
> container. We need support port mapping when docker container uses bridge 
> network.
> The following problems are what we faced:
> 1. Add "-P" to map docker container's exposed ports to automatically.
> 2. Add "-p" to let user specify specific ports to map.
> 3. Add service registry support for bridge network case, then app could find 
> each other. It could be done out of YARN, however it might be more convenient 
> to support it natively in YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725560#comment-16725560
 ] 

Hadoop QA commented on YARN-9116:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
35s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 48s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
33s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m  
8s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 12s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 10 new + 327 unchanged - 1 fixed = 337 total (was 328) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
42s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 90m  1s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
49s{color} | {color:green} hadoop-yarn-submarine in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
40s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}172m 34s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9116 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952434/YARN-9116.1.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 7d0bf6844083 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e815fd9 |
|

[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725542#comment-16725542
 ] 

Íñigo Goiri commented on YARN-9130:
---

Thanks [~trjianjianjiao] for the patch and [~surmountian] for the review!
Committed to trunk.

> Add Bind_HOST configuration for Yarn Web Proxy
> --
>
> Key: YARN-9130
> URL: https://issues.apache.org/jira/browse/YARN-9130
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9130.002.patch, YARN-9130.003.patch, 
> Yarn-9130.001.patch
>
>
> Allow configurable bind-host for Yarn Web Proxy to allow overriding the host 
> name for which the server accepts connections.
> It is similar to what have done in JournalNode and RM. Like 
> https://issues.apache.org/jira/browse/HDFS-13462
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8833) Avoid potential integer overflow when computing fair shares

2018-12-19 Thread liyakun (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyakun updated YARN-8833:
--
Description: 
When use w2rRatio compute fair share, there may be a chance triggering the 
problem of Int overflow, and entering an infinite loop.

Since the compute share thread holds the writeLock, it may blocking scheduling 
thread.

This issue occurs in a production environment. And we have already fixed it.

 

added 2018-10-29: elaborate the problem 

/**
 * Compute the resources that would be used given a weight-to-resource ratio
 * w2rRatio, for use in the computeFairShares algorithm as described in #
 */
 private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
 Collection schedulables, String type) \{ int 
resourcesTaken = 0; for (Schedulable sched : schedulables) { int share = 
computeShare(sched, w2rRatio, type); resourcesTaken += share; }
 return resourcesTaken;
 }

The variable resourcesTaken is an integer type. And it also is accumulated 
value of result of

computeShare(Schedulable sched, double w2rRatio,String type) which is a value 
between the min share and max share of a queue.

For example, when there are 3 queues, each has min share = max share = 

Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it will 
be a negative number.

when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) return a negative number, the 
loop in 

computeSharesInternal() may never out which got the scheduler lock.

 

//org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares

while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
 < totalResource)

{ rMax *= 2.0; }

This may blocking scheduling thread.

  was:
When use w2rRatio compute fair share, there may be a chance triggering the 
problem of Int overflow, and entering an infinite loop.

Since the compute share thread holds the writeLock, it may blocking scheduling 
thread.

This issue occurs in a production environment with 8500 nodes. And we have 
already fixed it.

 

added 2018-10-29: elaborate the problem 

/**
 * Compute the resources that would be used given a weight-to-resource ratio
 * w2rRatio, for use in the computeFairShares algorithm as described in #
 */
 private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
 Collection schedulables, String type) \{ int 
resourcesTaken = 0; for (Schedulable sched : schedulables) \{ int share = 
computeShare(sched, w2rRatio, type); resourcesTaken += share; }
return resourcesTaken;
 }

The variable resourcesTaken is an integer type. And it also is accumulated 
value of result of

computeShare(Schedulable sched, double w2rRatio,String type) which is a value 
between the min share and max share of a queue.

For example, when there are 3 queues, each has min share = max share = 

Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it will 
be a negative number.

when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection schedulables, String type) return a negative number, the 
loop in 

computeSharesInternal() may never out which got the scheduler lock.

 

//org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares

while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
 < totalResource)

{ rMax *= 2.0; }

This may blocking scheduling thread.


> Avoid potential integer overflow when computing fair shares
> ---
>
> Key: YARN-8833
> URL: https://issues.apache.org/jira/browse/YARN-8833
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: liyakun
>Assignee: liyakun
>Priority: Major
> Fix For: 3.0.4, 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-8833.1.patch, YARN-8833.2.patch, YARN-8833.3.patch, 
> YARN-8833.patch
>
>
> When use w2rRatio compute fair share, there may be a chance triggering the 
> problem of Int overflow, and entering an infinite loop.
> Since the compute share thread holds the writeLock, it may blocking 
> scheduling thread.
> This issue occurs in a production environment. And we have already fixed it.
>  
> added 2018-10-29: elaborate the problem 
> /**
>  * Compute the resources that would be used given a weight-to-resource ratio
>  * w2rRatio, for use in the computeFairShares algorithm as described in #
>  */
>  private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
>  Collection schedulables, String type) \{ int 
> resourcesTaken = 0; for (Schedulable sched : schedulables) { int share = 
> computeShare(sched, w2rRatio, type); resourcesTaken += share; }
>  return resourcesTaken;
>  }
> The variable resourcesTaken is an integer type. And it also is accumulated 
> value of result

[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725536#comment-16725536
 ] 

Íñigo Goiri commented on YARN-9130:
---

[^YARN-9130.003.patch] LGTM.
The approach mimics what is available for other components for bind-host.
+1
Committing to trunk.

> Add Bind_HOST configuration for Yarn Web Proxy
> --
>
> Key: YARN-9130
> URL: https://issues.apache.org/jira/browse/YARN-9130
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Attachments: YARN-9130.002.patch, YARN-9130.003.patch, 
> Yarn-9130.001.patch
>
>
> Allow configurable bind-host for Yarn Web Proxy to allow overriding the host 
> name for which the server accepts connections.
> It is similar to what have done in JournalNode and RM. Like 
> https://issues.apache.org/jira/browse/HDFS-13462
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers

2018-12-19 Thread Weiwei Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9038:
--
Attachment: YARN-9038.004.patch

> [CSI] Add ability to publish/unpublish volumes on node managers
> ---
>
> Key: YARN-9038
> URL: https://issues.apache.org/jira/browse/YARN-9038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: CSI
> Attachments: YARN-9038.001.patch, YARN-9038.002.patch, 
> YARN-9038.003.patch, YARN-9038.004.patch
>
>
> We need to add ability to publish volumes on node managers in staging area, 
> under NM's local dir. And then mount the path to docker container to make it 
> visible in the container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers

2018-12-19 Thread Weiwei Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725519#comment-16725519
 ] 

Weiwei Yang commented on YARN-9038:
---

Oops, v3 patch includes some unexpected changes. Correcting them now..

> [CSI] Add ability to publish/unpublish volumes on node managers
> ---
>
> Key: YARN-9038
> URL: https://issues.apache.org/jira/browse/YARN-9038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: CSI
> Attachments: YARN-9038.001.patch, YARN-9038.002.patch, 
> YARN-9038.003.patch
>
>
> We need to add ability to publish volumes on node managers in staging area, 
> under NM's local dir. And then mount the path to docker container to make it 
> visible in the container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Xun Liu (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725513#comment-16725513
 ] 

Xun Liu commented on YARN-5168:
---

[~eyang] , I checked the code carefully and I feel that the 3 errors reported 
by Jenkins have nothing to do with my code.
{quote}[https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt]

[https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-distributedshell.txt]

[https://builds.apache.org/job/PreCommit-YARN-Build/22923/artifact/out/patch-unit-hadoop-tools_hadoop-sls.txt]
{quote}
I leave the container newInstance with 7 parameters, which are set by the 
setExposedPort function. 

Please help me review the code, thank you!

> Add port mapping handling when docker container use bridge network
> --
>
> Key: YARN-5168
> URL: https://issues.apache.org/jira/browse/YARN-5168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Xun Liu
>Priority: Major
>  Labels: Docker
> Attachments: YARN-5168.001.patch, YARN-5168.002.patch, 
> YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, 
> YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, 
> YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, 
> YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, 
> YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, 
> YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png
>
>
> YARN-4007 addresses different network setups when launching the docker 
> container. We need support port mapping when docker container uses bridge 
> network.
> The following problems are what we faced:
> 1. Add "-P" to map docker container's exposed ports to automatically.
> 2. Add "-p" to let user specify specific ports to map.
> 3. Add service registry support for bridge network case, then app could find 
> each other. It could be done out of YARN, however it might be more convenient 
> to support it natively in YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread Xiao Liang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725511#comment-16725511
 ] 

Xiao Liang commented on YARN-9130:
--

Thanks [~trjianjianjiao] for the patch, this configuration is necessary in 
certain case, and [^YARN-9130.003.patch] looks good to me, +1 for it.

> Add Bind_HOST configuration for Yarn Web Proxy
> --
>
> Key: YARN-9130
> URL: https://issues.apache.org/jira/browse/YARN-9130
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Attachments: YARN-9130.002.patch, YARN-9130.003.patch, 
> Yarn-9130.001.patch
>
>
> Allow configurable bind-host for Yarn Web Proxy to allow overriding the host 
> name for which the server accepts connections.
> It is similar to what have done in JournalNode and RM. Like 
> https://issues.apache.org/jira/browse/HDFS-13462
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725507#comment-16725507
 ] 

Hadoop QA commented on YARN-9130:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
34s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  3s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
18s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 223 unchanged - 1 fixed = 223 total (was 224) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
46s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
28s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
51s{color} | {color:green} hadoop-yarn-server-web-proxy in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 83m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9130 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952431/YARN-9130.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 981aab09e4f8 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64

[jira] [Resolved] (YARN-8523) Interactive docker shell

2018-12-19 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-8523.
-
   Resolution: Fixed
Fix Version/s: 3.3.0

Resolved by YARN-8762.

> Interactive docker shell
> 
>
> Key: YARN-8523
> URL: https://issues.apache.org/jira/browse/YARN-8523
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Zian Chen
>Priority: Major
>  Labels: Docker
> Fix For: 3.3.0
>
>
> Some application might require interactive unix commands executions to carry 
> out operations.  Container-executor can interface with docker exec to debug 
> or analyze docker containers while the application is running.  It would be 
> nice to support an API to invoke docker exec to perform unix commands and 
> report back the output to application master.  Application master can 
> distribute and aggregate execution of the commands to record in application 
> master log file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8762) [Umbrella] Support Interactive Docker Shell to running Containers

2018-12-19 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-8762.
-
   Resolution: Fixed
Fix Version/s: 3.3.0
 Release Note: - Add shell access to YARN containers

All tasks are done. 

Thank you [~Zian Chen] for the contribution.
Thank you [~billie.rinaldi] for the detailed reviews.

> [Umbrella] Support Interactive Docker Shell to running Containers
> -
>
> Key: YARN-8762
> URL: https://issues.apache.org/jira/browse/YARN-8762
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Fix For: 3.3.0
>
> Attachments: Interactive Docker Shell design doc.pdf
>
>
> Debugging distributed application can be challenging on Hadoop. Hadoop 
> provide limited debugging ability through application log files. One of the 
> most frequently requested feature is to provide interactive shell to assist 
> real time debugging. This feature is inspired by docker exec to provide 
> ability to run arbitrary commands in docker container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-8762) [Umbrella] Support Interactive Docker Shell to running Containers

2018-12-19 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8762:
---

Assignee: Eric Yang

> [Umbrella] Support Interactive Docker Shell to running Containers
> -
>
> Key: YARN-8762
> URL: https://issues.apache.org/jira/browse/YARN-8762
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Zian Chen
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: Interactive Docker Shell design doc.pdf
>
>
> Debugging distributed application can be challenging on Hadoop. Hadoop 
> provide limited debugging ability through application log files. One of the 
> most frequently requested feature is to provide interactive shell to assist 
> real time debugging. This feature is inspired by docker exec to provide 
> ability to run arbitrary commands in docker container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup

2018-12-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725488#comment-16725488
 ] 

Hudson commented on YARN-9129:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15639 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15639/])
YARN-9129. Ensure flush after printing to log plus additional cleanup. (billie: 
rev 2e544dc921afeaa02e731cb273ac7776eec6e49d)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/ContainerShellWebSocket.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/YarnClientImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c


> Ensure flush after printing to log plus additional cleanup
> --
>
> Key: YARN-9129
> URL: https://issues.apache.org/jira/browse/YARN-9129
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9129.001.patch, YARN-9129.002.patch, 
> YARN-9129.003.patch
>
>
> Following up on findings in YARN-8962, I noticed the following issues in 
> container-executor and main.c:
> - There seem to be some vars that are not cleaned up in container_executor:
> In run_docker else: free docker_binary
> In exec_container:
>   before return INVALID_COMMAND_FILE: free docker_binary
>   3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead
>   cleanup needed before exit calls?
> - In YARN-8777 we added several fprintf(stderr calls, but the convention in 
> container-executor.c appears to be fprintf(ERRORFILE followed by 
> fflush(ERRORFILE).
> - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test.
> - There are additional places where flush is not performed after writing to 
> stderr, including main.c display_feature_disabled_message. This can result in 
> the client not receiving the error message if the connection is closed too 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9129) Ensure flush after printing to log plus additional cleanup

2018-12-19 Thread Billie Rinaldi (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725469#comment-16725469
 ] 

Billie Rinaldi commented on YARN-9129:
--

+1 for patch 3. Thanks, [~eyang]!

> Ensure flush after printing to log plus additional cleanup
> --
>
> Key: YARN-9129
> URL: https://issues.apache.org/jira/browse/YARN-9129
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9129.001.patch, YARN-9129.002.patch, 
> YARN-9129.003.patch
>
>
> Following up on findings in YARN-8962, I noticed the following issues in 
> container-executor and main.c:
> - There seem to be some vars that are not cleaned up in container_executor:
> In run_docker else: free docker_binary
> In exec_container:
>   before return INVALID_COMMAND_FILE: free docker_binary
>   3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead
>   cleanup needed before exit calls?
> - In YARN-8777 we added several fprintf(stderr calls, but the convention in 
> container-executor.c appears to be fprintf(ERRORFILE followed by 
> fflush(ERRORFILE).
> - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test.
> - There are additional places where flush is not performed after writing to 
> stderr, including main.c display_feature_disabled_message. This can result in 
> the client not receiving the error message if the connection is closed too 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9129) Ensure flush after printing to log plus additional cleanup

2018-12-19 Thread Billie Rinaldi (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-9129:
-
Summary: Ensure flush after printing to log plus additional cleanup  (was: 
Ensure flush after printing to stderr plus additional cleanup)

> Ensure flush after printing to log plus additional cleanup
> --
>
> Key: YARN-9129
> URL: https://issues.apache.org/jira/browse/YARN-9129
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9129.001.patch, YARN-9129.002.patch, 
> YARN-9129.003.patch
>
>
> Following up on findings in YARN-8962, I noticed the following issues in 
> container-executor and main.c:
> - There seem to be some vars that are not cleaned up in container_executor:
> In run_docker else: free docker_binary
> In exec_container:
>   before return INVALID_COMMAND_FILE: free docker_binary
>   3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead
>   cleanup needed before exit calls?
> - In YARN-8777 we added several fprintf(stderr calls, but the convention in 
> container-executor.c appears to be fprintf(ERRORFILE followed by 
> fflush(ERRORFILE).
> - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test.
> - There are additional places where flush is not performed after writing to 
> stderr, including main.c display_feature_disabled_message. This can result in 
> the client not receiving the error message if the connection is closed too 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues

2018-12-19 Thread Aihua Xu (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725466#comment-16725466
 ] 

Aihua Xu edited comment on YARN-9116 at 12/20/18 12:41 AM:
---

Patch-1: in this patch, add the simple logic to give the default memory/vcore 
values to the queues if no configuration is set for such queues. A new 
configuration "yarn.scheduler.capacity.default-queue-maximum-allocation" is 
added to set the queue default for maximum allocation.

I didn't implement queue inheritance since feel this would keep the 
configuration simpler. Let me know if it's needed and I can do that in the 
followup.




was (Author: aihuaxu):
Patch-1: in this patch, add the simple logic to give the default memory/vcore 
values to the queues if no configuration is set for such queues. A new 
configuration "yarn.scheduler.capacity.default-queue-maximum-allocation" is 
added to set the queue default for maximum allocation in the configuration. 



> Capacity Scheduler: add the default maximum-allocation-mb and 
> maximum-allocation-vcores for the queues
> --
>
> Key: YARN-9116
> URL: https://issues.apache.org/jira/browse/YARN-9116
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9116.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue 
> which is targeting to support larger container features on dedicated queues 
> (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . 
> While to achieve larger container configuration, we need to increase the 
> global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and 
> then override those configurations with desired values on the queues since 
> queue configuration can't be larger than cluster configuration. There are 
> many queues in the system and if we forget to configure such values when 
> adding a new queue, then such queue gets default 120G/256 which typically is 
> not what we want.  
> We can come up with a queue-default configuration (set to normal queue 
> configuration like 16G/8), so the leaf queues gets such values by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues

2018-12-19 Thread Aihua Xu (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated YARN-9116:
---
Attachment: YARN-9116.1.patch

> Capacity Scheduler: add the default maximum-allocation-mb and 
> maximum-allocation-vcores for the queues
> --
>
> Key: YARN-9116
> URL: https://issues.apache.org/jira/browse/YARN-9116
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Affects Versions: 2.7.0
>Reporter: Aihua Xu
>Assignee: Aihua Xu
>Priority: Major
> Attachments: YARN-9116.1.patch
>
>
> YARN-1582 adds the support of maximum-allocation-mb configuration per queue 
> which is targeting to support larger container features on dedicated queues 
> (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . 
> While to achieve larger container configuration, we need to increase the 
> global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and 
> then override those configurations with desired values on the queues since 
> queue configuration can't be larger than cluster configuration. There are 
> many queues in the system and if we forget to configure such values when 
> adding a new queue, then such queue gets default 120G/256 which typically is 
> not what we want.  
> We can come up with a queue-default configuration (set to normal queue 
> configuration like 16G/8), so the leaf queues gets such values by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9153) Diagnostics in the container status doesn't get reset after re-init

2018-12-19 Thread Chandni Singh (JIRA)

Chandni Singh created YARN-9153:
---

 Summary: Diagnostics in the container status doesn't get reset 
after re-init 
 Key: YARN-9153
 URL: https://issues.apache.org/jira/browse/YARN-9153
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, yarn
Reporter: Chandni Singh
Assignee: Chandni Singh


When a container is reinitialized, its diagnostics are set to a long string - 
"Reinitializing await...". Even after the container starts running, this 
diagnostics is not cleared. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread Rong Tang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725455#comment-16725455
 ] 

Rong Tang commented on YARN-9130:
-

[~elgoiri]  Fixed the checkstyle.

> Add Bind_HOST configuration for Yarn Web Proxy
> --
>
> Key: YARN-9130
> URL: https://issues.apache.org/jira/browse/YARN-9130
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Attachments: YARN-9130.002.patch, YARN-9130.003.patch, 
> Yarn-9130.001.patch
>
>
> Allow configurable bind-host for Yarn Web Proxy to allow overriding the host 
> name for which the server accepts connections.
> It is similar to what have done in JournalNode and RM. Like 
> https://issues.apache.org/jira/browse/HDFS-13462
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9130) Add Bind_HOST configuration for Yarn Web Proxy

2018-12-19 Thread Rong Tang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang updated YARN-9130:

Attachment: YARN-9130.003.patch

> Add Bind_HOST configuration for Yarn Web Proxy
> --
>
> Key: YARN-9130
> URL: https://issues.apache.org/jira/browse/YARN-9130
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Rong Tang
>Assignee: Rong Tang
>Priority: Major
> Attachments: YARN-9130.002.patch, YARN-9130.003.patch, 
> Yarn-9130.001.patch
>
>
> Allow configurable bind-host for Yarn Web Proxy to allow overriding the host 
> name for which the server accepts connections.
> It is similar to what have done in JournalNode and RM. Like 
> https://issues.apache.org/jira/browse/HDFS-13462
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9129) Ensure flush after printing to stderr plus additional cleanup

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725451#comment-16725451
 ] 

Hadoop QA commented on YARN-9129:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
29s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 52s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  9m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 17s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
21s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 25m 
44s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}137m 14s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9129 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952417/YARN-9129.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  cc  |
| uname | Linux 6e7470c5bfd8 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e815fd9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test

[jira] [Assigned] (YARN-9152) Auxiliary service REST API query does not return running services

2018-12-19 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-9152:
---

Assignee: Billie Rinaldi

> Auxiliary service REST API query does not return running services
> -
>
> Key: YARN-9152
> URL: https://issues.apache.org/jira/browse/YARN-9152
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Major
>
> Auxiliary service is configured with:
> {code}
> {
>   "services": [
> {
>   "name": "mapreduce_shuffle",
>   "version": "2",
>   "configuration": {
> "properties": {
>   "class.name": "org.apache.hadoop.mapred.ShuffleHandler",
>   "mapreduce.shuffle.transfer.buffer.size": "102400",
>   "mapreduce.shuffle.port": "13563"
> }
>   }
> }
>   ]
> }
> {code}
> Node manager log shows the service is registered:
> {code}
> 2018-12-19 22:38:57,466 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: 
> Reading auxiliary services manifest hdfs:/tmp/aux.json
> 2018-12-19 22:38:57,827 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: 
> Initialized auxiliary service mapreduce_shuffle
> 2018-12-19 22:38:57,828 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: 
> Adding auxiliary service mapreduce_shuffle version 2
> {code}
> REST API query shows:
> {code}
> $ curl --negotiate -u :  
> http://eyang-3.openstacklocal:8042/ws/v1/node/auxiliaryservices
> {"services":{}}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9152) Auxiliary service REST API query does not return running services

2018-12-19 Thread Eric Yang (JIRA)

Eric Yang created YARN-9152:
---

 Summary: Auxiliary service REST API query does not return running 
services
 Key: YARN-9152
 URL: https://issues.apache.org/jira/browse/YARN-9152
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Eric Yang


Auxiliary service is configured with:

{code}
{
  "services": [
{
  "name": "mapreduce_shuffle",
  "version": "2",
  "configuration": {
"properties": {
  "class.name": "org.apache.hadoop.mapred.ShuffleHandler",
  "mapreduce.shuffle.transfer.buffer.size": "102400",
  "mapreduce.shuffle.port": "13563"
}
  }
}
  ]
}
{code}

Node manager log shows the service is registered:
{code}
2018-12-19 22:38:57,466 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Reading 
auxiliary services manifest hdfs:/tmp/aux.json
2018-12-19 22:38:57,827 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: 
Initialized auxiliary service mapreduce_shuffle
2018-12-19 22:38:57,828 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding 
auxiliary service mapreduce_shuffle version 2
{code}

REST API query shows:
{code}
$ curl --negotiate -u :  
http://eyang-3.openstacklocal:8042/ws/v1/node/auxiliaryservices
{"services":{}}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9129) Ensure flush after printing to stderr plus additional cleanup

2018-12-19 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-9129:

Attachment: YARN-9129.003.patch

> Ensure flush after printing to stderr plus additional cleanup
> -
>
> Key: YARN-9129
> URL: https://issues.apache.org/jira/browse/YARN-9129
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9129.001.patch, YARN-9129.002.patch, 
> YARN-9129.003.patch
>
>
> Following up on findings in YARN-8962, I noticed the following issues in 
> container-executor and main.c:
> - There seem to be some vars that are not cleaned up in container_executor:
> In run_docker else: free docker_binary
> In exec_container:
>   before return INVALID_COMMAND_FILE: free docker_binary
>   3x return DOCKER_EXEC_FAILED: set exit code and goto cleanup instead
>   cleanup needed before exit calls?
> - In YARN-8777 we added several fprintf(stderr calls, but the convention in 
> container-executor.c appears to be fprintf(ERRORFILE followed by 
> fflush(ERRORFILE).
> - There are leaks in TestDockerUtil_test_add_ports_mapping_to_command_Test.
> - There are additional places where flush is not performed after writing to 
> stderr, including main.c display_feature_disabled_message. This can result in 
> the client not receiving the error message if the connection is closed too 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725325#comment-16725325
 ] 

Hadoop QA commented on YARN-5168:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 13 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 16m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
22m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  9m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
45s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
12s{color} | {color:green} root: The patch generated 0 new + 999 unchanged - 7 
fixed = 999 total (was 1006) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  8s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 10m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
26s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
28s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
13s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
20s{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the 
patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}

[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-19 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725318#comment-16725318
 ] 

Hudson commented on YARN-9126:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15638 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15638/])
YARN-9126.  Fix container clean up for reinitialization. (eyang: 
rev e815fd9c49e80b9200dd8852abe74fe219ad9110)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainersLauncher.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainersLauncher.java


> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9126.001.patch, YARN-9126.002.patch, 
> YARN-9126.003.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725274#comment-16725274
 ] 

Hadoop QA commented on YARN-9126:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  2s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 23s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 3 new + 114 unchanged - 9 fixed = 117 total (was 123) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
56s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 73m 26s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9126 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952388/YARN-9126.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 34cc13e7ba8a 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cf57113 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/22926/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22926/testReport/ |
| Max. process+thread count | 340 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U:

[jira] [Commented] (YARN-9131) Document usage of Dynamic auxiliary services

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725273#comment-16725273
 ] 

Hadoop QA commented on YARN-9131:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
36s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 39m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 24m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 41s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 59s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 51s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
16s{color} | {color:green} hadoop-mapreduce-client-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
22s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
41s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m 32s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce

[jira] [Commented] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725234#comment-16725234
 ] 

Hadoop QA commented on YARN-9038:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
41s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 5s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  7m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  7m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
46s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 35s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 22 new + 494 unchanged - 0 fixed = 516 total (was 494) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  4m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
20s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 49s{color} 
| {color:red} hadoop-yarn-api in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
31s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 19m 18s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 35s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 23s{color} 
| {color:red} hadoop-yarn-services-core in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 44s{color} 
| {color:red} hadoop-yarn-csi in the patch failed. {color} |

[jira] [Updated] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-19 Thread Chandni Singh (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9126:

Attachment: YARN-9126.003.patch

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Attachments: YARN-9126.001.patch, YARN-9126.002.patch, 
> YARN-9126.003.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9132) Add file permission check for auxiliary services manifest file

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725215#comment-16725215
 ] 

Hadoop QA commented on YARN-9132:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
57s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 51s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 81m 33s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9132 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952255/YARN-9132.2.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e33551813adf 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cf57113 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/22924/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22924/testReport/ |
| Max. process+thread count | 307 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U:

[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Jim Brennan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725212#comment-16725212
 ] 

Jim Brennan commented on YARN-9098:
---

[~snemeth] I found the bug (or at least one bug).

In findControllerPathInMountConfig(), you are not looping:
{noformat}
public static String findControllerPathInMountConfig(String controller,
CGroupsMountConfig mountConfig) {
  String path = mountConfig.getPathForController(controller);
  if (path != null) {
if (new File(path).canRead()) {
  return path;
} else {
  LOG.warn(String.format(
  "Skipping inaccessible cgroup mount point %s", path));
}
  }
  return null;
}
{noformat}
If the bad entry for the CPU controller comes before the good entry, then you 
will "skip" and return null for the CPU controller. This code path should loop 
through all of the mountconfig entries so that it will properly skip bad 
entries and find good ones.  It also makes me concerned about other uses of 
mountConfig.getPathForController() - the current code essentially assumes that 
there is only one entry for each controller.
 The reason I think I am hitting it and you (and precommit) are not is because 
this is a hash map, so the ordering is essentially random depending on the file 
paths. Since our file paths are different, we get different orderings, and 
since in my case the bad entry comes first, the tests fail for me.

I found this while debugging the testMtabParsing() test.

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code should be extracted from these places and be added in a new 
> class as this is a separate responsibility.
> As the output of the file parser is a Map>, it's better 
> to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance.
> ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is 
> responsible for producing the same results (Map>) to 
> store cgroups data, it does not operate on mtab file, but looking at the 
> filesystem for cgroup settings. As the output is the same, CGroupsMountConfig 
> should be used here, too.
> Again, this could should not be part of ResourceHandlerModule as it is a 
> different responsibility.
> One more thing which is strongly related to the methods above is 
> CGroupsHandlerImpl.initializeFromMountConfig: This method processes the 
> result of a parsed mtab file or a parsed cgroups filesystem data and stores 
> file system paths for all available controllers. This method invokes 
> findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl 
> and CGroupsLCEResourcesHandler, so it should be moved to a single place. To 
> store filesystem path and controller mappings, a new domain object could be 
> introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725199#comment-16725199
 ] 

Íñigo Goiri commented on YARN-9151:
---

Thanks [~yqwang] for the patch.
I think we want to add a specific test which handles the actual exception 
(i.e., {{UnknownHostException}}) and catch it.
It should be a matter of adding a weird host to the connect string.

Regarding the checkstyle, I'm not very sure how it checks indentation for 
switch/case but as this the first place it is used, let's follow the 
recommendation from Yetus and move to the left all the {{case}}.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>  
> {noformat}
> The standby RM failed to rejoin the election, but it will never retry or 
> crash later, *so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.*
>  So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types (until we

[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903
 ] 

Íñigo Goiri edited comment on YARN-9151 at 12/19/18 5:16 PM:
-

BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: hostxyz
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461)
at 
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* 
interface.

Could you please confirm above things?

So, in the patch, if rejoin election throws exception, it will send 
EMBEDDED_ELECTOR_FAILED, and then RM will crash.


was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated YARN-9151:
--
Attachment: (was: yarn_rm.zip)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>  
> {noformat}
> The standby RM failed to rejoin the election, but it will never retry or 
> crash later, *so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.*
>  So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types (until we triaged it to be in whitelist), should 
> crash RM, because we *cannot ensure* that they will *never* cause RM cannot 
> work in standby state, and the *conservative* way is to crash RM. 
>  Besides, after crash, the RM's external watchdog service can know this and 
> try to repair the RM machine, send alerts, etc. 
>  And the RM can reload the latest zk connect string config with the latest 
> hostnames.
> For more details, please

[jira] [Commented] (YARN-9132) Add file permission check for auxiliary services manifest file

2018-12-19 Thread Billie Rinaldi (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725160#comment-16725160
 ] 

Billie Rinaldi commented on YARN-9132:
--

There's a ticket open already about the flaky test failure. I've rerun the 
precommit as well.

> Add file permission check for auxiliary services manifest file
> --
>
> Key: YARN-9132
> URL: https://issues.apache.org/jira/browse/YARN-9132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Major
> Attachments: YARN-9132.1.patch, YARN-9132.2.patch
>
>
> The manifest file in HDFS must be owned by YARN admin or YARN service user 
> only.  This check helps to prevent loading of malware into node manager JVM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9131) Document usage of Dynamic auxiliary services

2018-12-19 Thread Billie Rinaldi (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-9131:
-
Attachment: YARN-9131.4.patch

> Document usage of Dynamic auxiliary services
> 
>
> Key: YARN-9131
> URL: https://issues.apache.org/jira/browse/YARN-9131
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Major
> Attachments: YARN-9131.1.patch, YARN-9131.2.patch, YARN-9131.3.patch, 
> YARN-9131.4.patch
>
>
> This is a follow up issue to document YARN-9075 for admin to control which 
> aux service to add or remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Jim Brennan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725136#comment-16725136
 ] 

Jim Brennan commented on YARN-9098:
---

[~snemeth] thanks for updating the patch.  I will download and retest.

 

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code should be extracted from these places and be added in a new 
> class as this is a separate responsibility.
> As the output of the file parser is a Map>, it's better 
> to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance.
> ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is 
> responsible for producing the same results (Map>) to 
> store cgroups data, it does not operate on mtab file, but looking at the 
> filesystem for cgroup settings. As the output is the same, CGroupsMountConfig 
> should be used here, too.
> Again, this could should not be part of ResourceHandlerModule as it is a 
> different responsibility.
> One more thing which is strongly related to the methods above is 
> CGroupsHandlerImpl.initializeFromMountConfig: This method processes the 
> result of a parsed mtab file or a parsed cgroups filesystem data and stores 
> file system paths for all available controllers. This method invokes 
> findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl 
> and CGroupsLCEResourcesHandler, so it should be moved to a single place. To 
> store filesystem path and controller mappings, a new domain object could be 
> introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Description: 
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.class
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}

     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create
{code}
 

 

  was:
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.class
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
**
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create
{code}
 

 


> Making TimelineSchemaCreator support different backends for Timeline Schema 
> Creation in ATSv2
> -
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.class
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
> {code:java}
> bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Summary: Making TimelineSchemaCreator support different backends for 
Timeline Schema Creation in ATSv2  (was: Making TimelineSchemaCreator to 
support different backends for Timeline Schema Creation in ATSv2)

> Making TimelineSchemaCreator support different backends for Timeline Schema 
> Creation in ATSv2
> -
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.class
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> **
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
> {code:java}
> bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725084#comment-16725084
 ] 

Sushil Ks commented on YARN-9150:
-

Hi  [~rohithsharma] and [~vrushalic], Kindly review this patch. Have created 
this JIRA for making *TimelineSchemaCreator* support multiple back-ends as 
discussed when you reviewed the 
[YARN-9016|https://issues.apache.org/jira/browse/YARN-9016] .

Note sure if the -1 for *compile* and *javac* tests posted from Jenkins above 
are related to my patch.

> Making TimelineSchemaCreator to support different backends for Timeline 
> Schema Creation in ATSv2
> 
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.class
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> **
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
> {code:java}
> bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725080#comment-16725080
 ] 

Hadoop QA commented on YARN-9151:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 38s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
38s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 27s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 9 new + 48 unchanged - 0 fixed = 57 total (was 48) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 17s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 33s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 25m 36s{color} 
| {color:red} hadoop-yarn-client in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
43s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}199m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9151 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952343/YARN-9151.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 3cc045645602 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cf57113 |
| maven |

[jira] [Updated] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Xun Liu (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xun Liu updated YARN-5168:
--
Attachment: YARN-5168.018.patch

> Add port mapping handling when docker container use bridge network
> --
>
> Key: YARN-5168
> URL: https://issues.apache.org/jira/browse/YARN-5168
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jun Gong
>Assignee: Xun Liu
>Priority: Major
>  Labels: Docker
> Attachments: YARN-5168.001.patch, YARN-5168.002.patch, 
> YARN-5168.003.patch, YARN-5168.004.patch, YARN-5168.005.patch, 
> YARN-5168.006.patch, YARN-5168.007.patch, YARN-5168.008.patch, 
> YARN-5168.009.patch, YARN-5168.010.patch, YARN-5168.011.patch, 
> YARN-5168.012.patch, YARN-5168.013.patch, YARN-5168.014.patch, 
> YARN-5168.015.patch, YARN-5168.016.patch, YARN-5168.017.patch, 
> YARN-5168.018.patch, exposedPorts1.png, exposedPorts2.png
>
>
> YARN-4007 addresses different network setups when launching the docker 
> container. We need support port mapping when docker container uses bridge 
> network.
> The following problems are what we faced:
> 1. Add "-P" to map docker container's exposed ports to automatically.
> 2. Add "-p" to let user specify specific ports to map.
> 3. Add service registry support for bridge network case, then app could find 
> each other. It could be done out of YARN, however it might be more convenient 
> to support it natively in YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9038) [CSI] Add ability to publish/unpublish volumes on node managers

2018-12-19 Thread Weiwei Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9038:
--
Attachment: YARN-9038.003.patch

> [CSI] Add ability to publish/unpublish volumes on node managers
> ---
>
> Key: YARN-9038
> URL: https://issues.apache.org/jira/browse/YARN-9038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: CSI
> Attachments: YARN-9038.001.patch, YARN-9038.002.patch, 
> YARN-9038.003.patch
>
>
> We need to add ability to publish volumes on node managers in staging area, 
> under NM's local dir. And then mount the path to docker container to make it 
> visible in the container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled

2018-12-19 Thread Zhankun Tang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725010#comment-16725010
 ] 

Zhankun Tang commented on YARN-9033:


[~snemeth], thanks for looking at this. 
{quote}"But actually, the "updateContainer" invocation in YARN-7715 depend on 
containerId's cgroups path creation in "preStart" method which only happens 
when we use "LinuxContainerExecutor"."

Where can I find this code part / what should I check?
{quote}
You can just test with LCE disabled but cGroupsMemoryResourceHandlerImpl 
enabled to try if YARN-7715 works. Per my testing, it doesn't work.

Or understand that "updateContainer" in YARN-7715 is actually doing cgroups 
update. This cgroups update depend on an existing cgroups path. Take 
cGroupsMemoryResourceHandlerImpl for instance,

The cGroupsMemoryResourceHandlerImpl#preStart created the memory cgroups path. 
And cGroupsMemoryResourceHandlerImpl#updateContainer update cgroups value in 
this path.

But the preStart can only be invoked by LCE using ResourceHandlerChain's 
preStart. So YARN-7715 depend on LCE enabled. It shouldn't bootstrap 
ResourceHandleChain again. Not sure if this makes sense to you.

> ResourceHandlerChain#bootstrap is invoked twice during NM start if 
> LinuxContainerExecutor enabled
> -
>
> Key: YARN-9033
> URL: https://issues.apache.org/jira/browse/YARN-9033
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9033-trunk.001.patch, YARN-9033-trunk.002.patch
>
>
> The ResourceHandlerChain#bootstrap will always be invoked in NM's 
> ContainerScheduler#serviceInit (Involved by YARN-7715)
> So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked 
> first and then invoked again in ContainerScheduler#serviceInit.
> But actually, the "updateContainer" invocation in YARN-7715 depend on 
> containerId's cgroups path creation in "preStart" method which only happens 
> when we use "LinuxContainerExecutor". So the bootstrap of 
> ResourceHandlerChain shouldn't happen in ContainerScheduler#serviceInit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724967#comment-16724967
 ] 

Hadoop QA commented on YARN-9098:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 27s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 19s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 4 new + 11 unchanged - 0 fixed = 15 total (was 11) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 39s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
11s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 70m  0s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12952342/YARN-9098.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux d61f07cd88cc 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cf57113 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/22920/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22920/testReport/ |
| Max. process+thread count | 424 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U:

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
 
{noformat}
The standby RM failed to rejoin the election, but it will never retry or crash 
later, *so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.*
 So, this should be a bug in RM, because *RM should always try to join 
election* (give up join election should only happen on RM decide to crash), 
otherwise, a RM without inside the election can never become active again and 
start real works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types (until we triaged it to be in whitelist), should crash RM, 
because we *cannot ensure* that they will *never* cause RM cannot work in 
standby state, and the *conservative* way is to crash RM. 
 Besides, after crash, the RM's external watchdog service can know this and try 
to repair the RM machine, send alerts, etc. 
 And the RM can reload the latest zk connect string config with the latest 
hostnames.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop

[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903
 ] 

Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:48 AM:


BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461)
at 
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* 
interface.

Could you please confirm above things?

So, in the patch, if rejoin election throws exception, it will send 
EMBEDDED_ELECTOR_FAILED, and then RM will crash.


was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to rejoin the election, but it will never retry or crash 
later, so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types (until we triaged it to be in whitelist), should crash RM, 
because we *cannot ensure* that they will *never* cause RM cannot work in 
standby state, and the *conservative* way is to crash RM. 
Besides, after crash, the RM's external watchdog service can know this and try 
to repair the RM machine, send alerts, etc. 
And the RM can reload the latest zk connect string config with the latest 
hostnames.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to rejoin the election, but it will never retry or crash 
later, so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types (until we triaged it to be in whitelist), should crash RM, 
because we *cannot ensure* that they will *never* cause RM cannot work in 
standby state, and the *conservative* way is to crash RM. Besides, after crash, 
the RM's external watchdog service can know this and try to repair the RM 
machine, send alerts, etc. And the RM can reload the latest zk connect string 
config with the latest hostnames.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Attachment: (was: YARN-9151.001.patch)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to rejoin the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author 
> [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will *never* cause RM cannot work in standby state, and the 
> *conservative* way is to crash RM. Besides, after crash, the RM's external 
> watchdog service can know this and try to repair the RM machine, send

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724913#comment-16724913
 ] 

Yuqi Wang commented on YARN-9151:
-

[~elgoiri], could you please also check it.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to rejoin the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author 
> [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will *never* cause RM cannot work in standby state, and the 
> *conservative* way is to crash RM. Besides, after crash, the RM's external 
> watchdog service can know this

[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724913#comment-16724913
 ] 

Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:37 AM:


[~elgoiri], could you please also check it. :)


was (Author: yqwang):
[~elgoiri], could you please also check it.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to rejoin the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author 
> [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will *never* cause RM cannot work in standby

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to rejoin the election, but it will never retry or crash 
later, so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
*never* cause RM cannot work in standby state, and the *conservative* way is to 
crash RM. Besides, after crash, the RM's external watchdog service can know 
this and try to repair the RM machine, send alerts, etc. And the RM can reload 
the latest zk connect string config with the latest hostnames.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start

[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903
 ] 

Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM:


BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461)
at 
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* 
interface.

So, in the patch, if rejoin election throws exception, it will send 
EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk 
connect string config.


was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)

[jira] [Comment Edited] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903
 ] 

Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM:


BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461)
at 
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* 
interface.

So, in the patch, if rejoin election throws exception, it will send 
EMBEDDED_ELECTOR_FAILED, and then RM will crash.


was (Author: yqwang):
BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724903#comment-16724903
 ] 

Yuqi Wang commented on YARN-9151:
-

BTW, [~jianhe], for YARN-4438, you said:
{quote}_If it is due to close(), don't we want to force give-up so the other RM 
becomes active? If it is on initAndStartLeaderLatch(), *this RM will never 
become active; don't we want to just die?*_

What do you mean by force give-up ? exit RM ?
 The underlying curator implementation *will retry the connection in 
background*, even though the exception is thrown. See *Guaranteeable* interface 
in Curator. I think exit RM is too harsh here. Even though RM remains at 
standby, all services should be already shutdown, so there's no harm to the end 
users ?
{quote}
However, for this case, if we are using CuratorBasedElectorService, I think 
curator will *NOT* retry the connection, because I saw below things in the log 
and checked curator's code:

*Background exception was not retry-able or retry gave up for 
UnknownHostException*
{code:java}
2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception 
was not retry-able or retry gave up
java.net.UnknownHostException: BN2AAP10C07C229
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at 
org.apache.zookeeper.client.StaticHostProvider.(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.(ZooKeeper.java:461)
at 
org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:218)
at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193)
at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)
at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Besides, in YARN-4438, I did not see you used the *Guaranteeable* interface in 
Curator.

So, in the patch, if rejoin election throws exception, it will send 
EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk 
connect string config.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start

[jira] [Comment Edited] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724900#comment-16724900
 ] 

Szilard Nemeth edited comment on YARN-9098 at 12/19/18 11:26 AM:
-

Hi [~Jim_Brennan]!

 

Thanks for looking at my patch and taking steps on testing!

First of all:  the static import is fixed with the new patch.

Actually, I understand your fear regarding this refactoring but this eliminates 
some code duplication and improves the quality of the code significantly. 

 

About the test errors: 
 These are very strange errors. At this level, they seem to be all test 
framework issues, as somehow on your computer, the temp directory storing 
cgroups and the cpu controller underneath have not created, which is very 
strange as the tests are having assertions on the file and directory creations.

I re-built and ran the testcases with these commands again on my computer, 
locally: 
 1.
{code:java}
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code}
2.
{code:java}
mvn test -pl org.apache.hadoop:hadoop-yarn-server-nodemanager -fae | tee 
~/maventest`date +%Y%m%d`
{code}
 

I don't have test failures on any of the classes where you had, this is the 
excerpt of the output:

 
{noformat}
[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.138 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser
...

[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.538 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl
 
...
[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.127 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths
{noformat}
 

 

This is the file listing of my target directory: 
{noformat}
??-( szilardnemeth@snemeth-MBP[10:17:19] <0> @YARN-9098 )--( 
~/development/apache/hadoop )--
└-$ cat 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/025802ba-4abe-4862-942b-beef2d279ca7
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cp
 cgroup rw,relatime,cpu 0 0
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cpu
 cgroup rw,relatime,cpu 0 0
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/blkio
 cgroup rw,relatime,blkio 0 0{noformat}
I even tried to remove the test directory (with {{rm -rf 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir}})
 and re-execute the tests but they work fine.
 Anyway, I added some code that better describes the assertion errors plus 
added some logging of cgroup paths to 2 testcases.

 

Could you please rerun the tests with the new code and do a file listing like I 
did? 
 This must be some platform issue I suppose.

The test failures for {{TestCGroupsHandlerImpl}}: 
 {{TestCGroupsHandlerImpl#createPremountedCgroups}} calls:
{noformat}
File cpuCgroup = new File(parentDir, "cpu");
//and later on...
assertTrue("Directory should be created", cpuCgroup.mkdirs());
{noformat}
 

This should create cgroups for cpu, and as you can see, it is even asserted 
properly.

Could you please re-test?

Thanks!

 

 


was (Author: snemeth):
Hi [~Jim_Brennan]!

 

Thanks for looking at my patch and taking steps on testing!

First of all:  the static import is fixed with the new patch.

Actually, I understand your fear regarding this refactoring but this eliminates 
some code duplication and improves the quality of the code significantly. 

 

About the test errors: 
These are very strange errors. At this level, they seem to be all test 
framework issues, as somehow on your computer, the temp directory storing 
cgroups and the cpu controller underneath have not created, which is very 
strange as the tests are having assertions on the file and directory creations.


I re-built and ran the testcases with these commands again on my computer, 
locally: 
1.
{code:java}
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code}

2.
{code:java}
mvn test -pl

[jira] [Updated] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Szilard Nemeth (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9098:
-
Attachment: YARN-9098.004.patch

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code should be extracted from these places and be added in a new 
> class as this is a separate responsibility.
> As the output of the file parser is a Map>, it's better 
> to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance.
> ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is 
> responsible for producing the same results (Map>) to 
> store cgroups data, it does not operate on mtab file, but looking at the 
> filesystem for cgroup settings. As the output is the same, CGroupsMountConfig 
> should be used here, too.
> Again, this could should not be part of ResourceHandlerModule as it is a 
> different responsibility.
> One more thing which is strongly related to the methods above is 
> CGroupsHandlerImpl.initializeFromMountConfig: This method processes the 
> result of a parsed mtab file or a parsed cgroups filesystem data and stores 
> file system paths for all available controllers. This method invokes 
> findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl 
> and CGroupsLCEResourcesHandler, so it should be moved to a single place. To 
> store filesystem path and controller mappings, a new domain object could be 
> introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule

2018-12-19 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724900#comment-16724900
 ] 

Szilard Nemeth commented on YARN-9098:
--

Hi [~Jim_Brennan]!

 

Thanks for looking at my patch and taking steps on testing!

First of all:  the static import is fixed with the new patch.

Actually, I understand your fear regarding this refactoring but this eliminates 
some code duplication and improves the quality of the code significantly. 

 

About the test errors: 
These are very strange errors. At this level, they seem to be all test 
framework issues, as somehow on your computer, the temp directory storing 
cgroups and the cpu controller underneath have not created, which is very 
strange as the tests are having assertions on the file and directory creations.


I re-built and ran the testcases with these commands again on my computer, 
locally: 
1.
{code:java}
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true{code}

2.
{code:java}
mvn test -pl org.apache.hadoop:hadoop-yarn-server-nodemanager -fae | tee 
~/maventest`date +%Y%m%d`
{code}
 

I don't have test failures on any of the classes where you had, this is the 
excerpt of the output:

 
{noformat}
[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.138 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestMtabFileParser
...

[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.538 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsHandlerImpl
 
...
[INFO] Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.127 s 
- in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestCGroupsControllerPaths
{noformat}
 

 

This is the file listing of my target directory: 
{noformat}
??-( szilardnemeth@snemeth-MBP[10:17:19] <0> @YARN-9098 )--( 
~/development/apache/hadoop )--
└-$ cat 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/025802ba-4abe-4862-942b-beef2d279ca7
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cp
 cgroup rw,relatime,cpu 0 0
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/cpu
 cgroup rw,relatime,cpu 0 0
none 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir/cgroups/blkio
 cgroup rw,relatime,blkio 0 0{noformat}

I even tried to remove the test directory (with {{rm -rf 
/Users/szilardnemeth/development/apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-dir}})
 and re-execute the tests but they work fine.
Anyway, I added some code that better describes the assertion errors plus added 
some logging of cgroup paths to 2 testcases.

 

Could you please rerun the tests with the new code and do a file listing like I 
did? 
This must be some platform issue I suppose.


The test failures for {{TestCGroupsHandlerImpl}}: 
{{TestCGroupsHandlerImpl#createPremountedCgroups}} calls: 
{noformat}
File cpuCgroup = new File(parentDir, "cpu");
//and later on...
assertTrue("Directory should be created", cpuCgroup.mkdirs());
{noformat}
 

This should create cgroups for cpu, and as you can see, it is even asserted 
properly.

 

Thanks!

 

 

> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> --
>
> Key: YARN-9098
> URL: https://issues.apache.org/jira/browse/YARN-9098
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9098.002.patch, YARN-9098.003.patch, 
> YARN-9098.004.patch
>
>
> Separate mtab file reader code and cgroups file system hierarchy parser code 
> from CGroupsHandlerImpl and ResourceHandlerModule
> CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores 
> cgroups data.
> CGroupsLCEResourcesHandler also has a method with the same name, with 
> identical code.
> The parser code

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to rejoin the election, but it will never retry or crash 
later, so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
*never* cause RM cannot work in standby state, and the *conservative* way is to 
crash RM. Besides, after crash, the RM's external watchdog service can know 
this and try to repair the RM machine, send alerts, etc.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS.
 (In reality, we need to replace old/bad zk machines to new/good zk machines, 
so their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException (Here the 
exception is eat and just send event)
  Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to rejoin the election, but it will never retry or crash 
later, so afterwards no zk related logs and the standby RM is forever hang, 
even if the zk connect string hostnames are changed back the orignal ones in 
DNS.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
*never* cause RM cannot work in standby state, and the *conservative* way is to 
crash RM. Besides, after crash, the RM's external watchdog service can know 
this and try to repair the RM machine, send alerts, etc.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election

[jira] [Commented] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724895#comment-16724895
 ] 

Yuqi Wang commented on YARN-9151:
-

[~kasha] and [~templedf], could you please look at this issue and fix.

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS.
>  (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318).
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException (Here the 
> exception is eat and just send event)
>   Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to rejoin the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang, even if the zk connect string hostnames are changed back the orignal 
> ones in DNS.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will *never* cause RM cannot work in standby state, and the 
> *conservative* way is to crash RM. Besides, after crash, the RM's external 
> watchdog service can know this and try to repair the RM machine, send alerts, 
> etc.
> For more details, please check the patch.



--
This message was sent by

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
never cause RM cannot work in standby state, the *conservative* way is to crash 
RM. Besides, after crash, the RM watchdog can know this and try to repair the 
RM machine, send alerts, etc.

For more details, please check the patch.

  was:
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Fix Version/s: 2.9.2

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1, 2.9.2
>
> Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Release Note: Fix standby RM hangs (not retry or crash) forever due to 
forever lost from leader election. And now, RM will only transition to standby 
for known safe fatal events.  (was: Fix standby RM hangs (not retry or crash) 
forever due to forever lost from leader election. And now, RM will only 
transition to standby for known fatal events.)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> {color:#205081}*Issue Summary:*{color}
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> {color:#205081}*Issue Repro Steps:*{color}
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> {color:#205081}*Issue Logs:*{color}
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> {color:#205081}*Caused By:*{color}
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> {color:#205081}*What the Patch's solution:*{color}
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
> And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
{color:#205081}*Issue Summary:*{color}
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

{color:#205081}*Issue Repro Steps:*{color}
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

{color:#205081}*Issue Logs:*{color}

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

{color:#205081}*Caused By:*{color}

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

{color:#205081}*What the Patch's solution:*{color}

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
never cause RM cannot work in standby state, the *conservative* way is to crash 
RM. Besides, after crash, the RM watchdog can know this and try to repair the 
RM machine, send alerts, etc.

For more details, please check the patch.

  was:
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Fix Version/s: (was: 2.9.2)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

*Caused By:*

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

*What the Patch's solution:*

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }

And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and future 
added failure types, should crash RM, because we *cannot ensure* that they will 
never cause RM cannot work in standby state, the *conservative* way is to crash 
RM. Besides, after crash, the RM watchdog can know this and try to repair the 
RM machine, send alerts, etc.

For more details, please check the patch.

  was:
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Attachment: (was: YARN-9151-branch-2.9.2.001.patch)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Attachment: YARN-9151.001.patch

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Fix Version/s: (was: 3.1.1)

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Attachment: YARN-9151-branch-2.9.2.001.patch

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Fix Version/s: (was: 2.9.2)
   3.1.1

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Fix Version/s: 2.9.2

> Standby RM hangs (not retry or crash) forever due to forever lost from leader 
> election
> --
>
> Key: YARN-9151
> URL: https://issues.apache.org/jira/browse/YARN-9151
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.2
>Reporter: Yuqi Wang
>Assignee: Yuqi Wang
>Priority: Major
>  Labels: patch
> Fix For: 3.1.1
>
> Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip
>
>
> *Issue Summary:*
>  Standby RM hangs (not retry or crash) forever due to forever lost from 
> leader election
>  
> *Issue Repro Steps:*
>  # Start multiple RMs in HA mode
>  # Modify all hostnames in the zk connect string to different values in DNS. 
> (In reality, we need to replace old/bad zk machines to new/good zk machines, 
> so their DNS hostname will be changed.)
>  
> *Issue Logs:*
> The RM is BN4SCH101222318
> You can check the full RM log in attachment, yarn_rm.zip.
> To make it clear, the whole story is:
> {noformat}
> Join Election
> Win the leader (ZK Node Creation Callback)
>   Start to becomeActive 
> Start RMActiveServices 
> Start CommonNodeLabelsManager failed due to zk connect 
> UnknownHostException
> Stop CommonNodeLabelsManager
> Stop RMActiveServices
> Create and Init RMActiveServices
>   Fail to becomeActive 
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException 
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Already in standby state
>   ReJoin Election
>   Failed to Join Election due to zk connect UnknownHostException
>   (Here the exception is eat and just send transition to Standby event)
>   Send RMFatalEvent to transition RM to standby
> Transitioning RM to Standby
>   Start StandByTransitionThread
>   Found RMActiveServices's StandByTransitionRunnable object has already run 
> previously, so immediately return
>    
> (The standby RM failed to re-join the election, but it will never retry or 
> crash later, so afterwards no zk related logs and the standby RM is forever 
> hang.)
> {noformat}
> So, this should be a bug in RM, because *RM should always try to join 
> election* (give up join election should only happen on RM decide to crash), 
> otherwise, a RM without inside the election can never become active again and 
> start real works.
>  
> *Caused By:*
> It is introduced by YARN-3742
> The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
> happens, RM should transition to standby, instead of crash.
>  *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition 
> to standby, instead of crash.* (In contrast, before this change, RM makes all 
> to crash instead of to standby)
>  So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it 
> will leave the standby RM continue not work, such as stay in standby forever.
> And as the author said:
> {quote}I think a good approach here would be to change the RMFatalEvent 
> handler to transition to standby as the default reaction, *with shutdown as a 
> special case for certain types of failures.*
> {quote}
> But the author is *too optimistic when implement the patch.*
>  
> *What the Patch's solution:*
> So, for *conservative*, we would better *only transition to standby for the 
> failures in {color:#14892c}whitelist{color}:*
>  public enum RMFatalEventType {
>  {color:#14892c}// Source <- Store{color}
>  {color:#14892c}STATE_STORE_FENCED,{color}
>  {color:#14892c}STATE_STORE_OP_FAILED,{color}
> // Source <- Embedded Elector
>  EMBEDDED_ELECTOR_FAILED,
> {color:#14892c}// Source <- Admin Service{color}
>  {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}
> // Source <- Critical Thread Crash
>  CRITICAL_THREAD_CRASH
>  }
>  And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
> future added failure types, should crash RM, because we *cannot ensure* that 
> they will never cause RM cannot work in standby state, the *conservative* way 
> is to crash RM. Besides, after crash, the RM watchdog can know this and try 
> to repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

*Caused By:*

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

*What the Patch's solution:*

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 {color:#14892c}// Source <- Store{color}
 {color:#14892c}STATE_STORE_FENCED,{color}
 {color:#14892c}STATE_STORE_OP_FAILED,{color}

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}// Source <- Admin Service{color}
 {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }


 And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
future added failure types, should crash RM, because we *cannot ensure* that 
they will never cause RM cannot work in standby state, the *conservative* way 
is to crash RM. Besides, after crash, the RM watchdog can know this and try to 
repair the RM machine, send alerts, etc.

  was:
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due

[jira] [Created] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)

Yuqi Wang created YARN-9151:
---

 Summary: Standby RM hangs (not retry or crash) forever due to 
forever lost from leader election
 Key: YARN-9151
 URL: https://issues.apache.org/jira/browse/YARN-9151
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.9.2
Reporter: Yuqi Wang
Assignee: Yuqi Wang
 Fix For: 3.1.1
 Attachments: yarn_rm.zip

*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

*Caused By:*

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

*What the Patch's solution:*

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 *{color:#14892c}// Source <- Store{color}*
 *{color:#14892c}STATE_STORE_FENCED,{color}*
 *{color:#14892c}STATE_STORE_OP_FAILED,{color}*

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}*// Source <- Admin Service*{color}
{color:#14892c} *TRANSITION_TO_ACTIVE_FAILED,*{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }
 And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
future added failure types, should crash RM, because we *cannot ensure* that 
they will never cause RM cannot work in standby state, the *conservative* way 
is to crash RM. Besides, after crash, the RM watchdog can know this and try to 
repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

2018-12-19 Thread Yuqi Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Wang updated YARN-9151:

Description: 
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

 

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

*Caused By:*

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author said:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

*What the Patch's solution:*

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 *{color:#14892c}// Source <- Store{color}*
 *{color:#14892c}STATE_STORE_FENCED,{color}*
 *{color:#14892c}STATE_STORE_OP_FAILED,{color}*

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}*// Source <- Admin Service*{color}
 {color:#14892c} *TRANSITION_TO_ACTIVE_FAILED,*{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }
 And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
future added failure types, should crash RM, because we *cannot ensure* that 
they will never cause RM cannot work in standby state, the *conservative* way 
is to crash RM. Besides, after crash, the RM watchdog can know this and try to 
repair the RM machine, send alerts, etc.

  was:
*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
Start RMActiveServices 
Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
Stop CommonNodeLabelsManager
Stop RMActiveServices
Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election

[jira] [Commented] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724867#comment-16724867
 ] 

Hadoop QA commented on YARN-9150:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
42s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
11s{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  3m 
58s{color} | {color:red} hadoop-yarn in trunk failed. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase-tests
 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
58s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  7m 58s{color} 
| {color:red} hadoop-yarn-project_hadoop-yarn generated 46 new + 87 unchanged - 
0 fixed = 133 total (was 87) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 34s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase-tests
 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
42s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
10s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
29s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-client in 
the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 
33s{color} | {color:green} hadoop-yarn-server-timelineservice-hbase-tests in 
the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} |

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Description: 
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.class
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
**
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create
{code}
 

 

  was:
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
**
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.


{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create
{code}
 

 


> Making TimelineSchemaCreator to support different backends for Timeline 
> Schema Creation in ATSv2
> 
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.class
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> **
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
> {code:java}
> bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5168) Add port mapping handling when docker container use bridge network

2018-12-19 Thread Hadoop QA (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724815#comment-16724815
 ] 

Hadoop QA commented on YARN-5168:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 13 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
9s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
23m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  9m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  5m  
3s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 14m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  3m 
 9s{color} | {color:green} root: The patch generated 0 new + 1004 unchanged - 7 
fixed = 1004 total (was 1011) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  7m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 13s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 11m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  5m  
6s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
45s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
32s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 
35s{color} | {color:green} hadoop-yarn-server-common in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 19m 
17s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
27s{color} | {color:green} hadoop-yarn-server-applicationhistoryservice in the 
patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Description: 
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
**
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.


{code:java}
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create
{code}
 

 

  was:
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   **  Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;amp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
h3.  

*Running*
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
 **
 bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create


> Making TimelineSchemaCreator to support different backends for Timeline 
> Schema Creation in ATSv2
> 
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.classamp;amp;amp;lt;/name>
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> **
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
> {code:java}
> bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Description: 
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

 **  Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
h3.  

*Running*
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
 **
 bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create

  was:
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   **   Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classlt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
h3. 
*Running*
    The Command needed to run the TimelineSchemaCreator need not be changed i.e 
the below existing command can be used irrespective of the backend configured.
**
bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create


> Making TimelineSchemaCreator to support different backends for Timeline 
> Schema Creation in ATSv2
> 
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>  **  Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.classamp;lt;/name>
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> h3.  
> *Running*
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
>  **
>  bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9150) Making TimelineSchemaCreator to support different backends for Timeline Schema Creation in ATSv2

2018-12-19 Thread Sushil Ks (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Ks updated YARN-9150:

Description: 
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

   **  Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;amp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
h3.  

*Running*
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
 **
 bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create

  was:
h3. Currently the TimelineSchemaCreator has a concrete implementation for 
creating Timeline Schema's only for HBase, Hence creating this JIRA for 
supporting multiple back-ends that ATSv2 can support.

*Usage:*

 **  Add the following property in *yarn-site.xml*
{code:java}


 yarn.timeline-service.schema-creator.classamp;lt;/name>
 YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 


{code}
h3.  

*Running*
     The Command needed to run the TimelineSchemaCreator need not be changed 
i.e the below existing command can be used irrespective of the backend 
configured.
 **
 bin/hadoop 
org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
-create


> Making TimelineSchemaCreator to support different backends for Timeline 
> Schema Creation in ATSv2
> 
>
> Key: YARN-9150
> URL: https://issues.apache.org/jira/browse/YARN-9150
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2
>Reporter: Sushil Ks
>Assignee: Sushil Ks
>Priority: Major
> Attachments: YARN-9150.001.patch
>
>
> h3. Currently the TimelineSchemaCreator has a concrete implementation for 
> creating Timeline Schema's only for HBase, Hence creating this JIRA for 
> supporting multiple back-ends that ATSv2 can support.
> *Usage:*
>    **  Add the following property in *yarn-site.xml*
> {code:java}
> 
> 
>  yarn.timeline-service.schema-creator.classamp;amp;lt;/name>
>  YOUR_TIMELINE_SCHEMA_CREATOR_CLASS 
> 
> {code}
> h3.  
> *Running*
>      The Command needed to run the TimelineSchemaCreator need not be changed 
> i.e the below existing command can be used irrespective of the backend 
> configured.
>  **
>  bin/hadoop 
> org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator 
> -create



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-9149) yarn container -status misses logUrl when integrated with ATSv2

2018-12-19 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S reassigned YARN-9149:
---

Assignee: Rohith Sharma K S

> yarn container -status misses logUrl when integrated with ATSv2
> ---
>
> Key: YARN-9149
> URL: https://issues.apache.org/jira/browse/YARN-9149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Major
>
> Post YARN-8303, yarn client can be integrated with ATSv2. But log url and 
> start and end time is printing data is wrong!
> {code}
> Container Report :
>   Container-Id : container_1545035586969_0001_01_01
>   Start-Time : 0
>   Finish-Time : 0
>   State : COMPLETE
>   Execution-Type : GUARANTEED
>   LOG-URL : null
>   Host : localhost:25006
>   NodeHttpAddress : localhost:25008
>   Diagnostics :
> {code}
> # TimelineEntityV2Converter#convertToContainerReport set logUrl as *null*. 
> This need set for proper log url based on yarn.log.server.web-service.url
> # TimelineEntityV2Converter#convertToContainerReport parses start/end time 
> wrongly. Comparison should happen with entityType but below code is doing 
> entityId
> {code}
> if (events != null) {
>   for (TimelineEvent event : events) {
> if (event.getId().equals(
> ContainerMetricsConstants.CREATED_IN_RM_EVENT_TYPE)) {
>   createdTime = event.getTimestamp();
> } else if (event.getId().equals(
> ContainerMetricsConstants.FINISHED_IN_RM_EVENT_TYPE)) {
>   finishedTime = event.getTimestamp();
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

94 matches

Mail list logo