date:20140729


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: YARN-796.patch.1

First patch based on LabelBasedScheduling design document

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Labels:   (was: patch)

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077435#comment-14077435
 ] 

Hadoop QA commented on YARN-796:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4467//console

This message is automatically generated.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077440#comment-14077440
 ] 

Hadoop QA commented on YARN-796:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4468//console

This message is automatically generated.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077443#comment-14077443
]

Hadoop QA commented on YARN-611:

{color:green}+1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12658363/YARN-611.4.rebase.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 5 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/4466//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4466//console

This message is automatically generated.

Add an AM retry count reset window to YARN RM
-

Key: YARN-611
URL: https://issues.apache.org/jira/browse/YARN-611
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.0.3-alpha
Reporter: Chris Riccomini
Assignee: Xuan Gong
Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch,
YARN-611.4.patch, YARN-611.4.rebase.patch

YARN currently has the following config:
yarn.resourcemanager.am.max-retries
This config defaults to 2, and defines how many times to retry a failed AM
before failing the whole YARN job. YARN counts an AM as failed if the node
that it was running on dies (the NM will timeout, which counts as a failure
for the AM), or if the AM dies.
This configuration is insufficient for long running (or infinitely running)
YARN jobs, since the machine (or NM) that the AM is running on will
eventually need to be restarted (or the machine/NM will fail). In such an
event, the AM has not done anything wrong, but this is counted as a failure
by the RM. Since the retry count for the AM is never reset, eventually, at
some point, the number of machine/NM failures will result in the AM failure
count going above the configured value for
yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the
job as failed, and shut it down. This behavior is not ideal.
I propose that we add a second configuration:
yarn.resourcemanager.am.retry-count-window-ms
This configuration would define a window of time that would define when an AM
is well behaved, and it's safe to reset its failure count back to zero.
Every time an AM fails the RmAppImpl would check the last time that the AM
failed. If the last failure was less than retry-count-window-ms ago, and the
new failure count is max-retries, then the job should fail. If the AM has
never failed, the retry count is max-retries, or if the last failure was
OUTSIDE the retry-count-window-ms, then the job should be restarted.
Additionally, if the last failure was outside the retry-count-window-ms, then
the failure count should be set back to 0.
This would give developers a way to have well-behaved AMs run forever, while
still failing mis-behaving AMs after a short period of time.
I think the work to be done here is to change the RmAppImpl to actually look
at app.attempts, and see if there have been more than max-retries failures in
the last retry-count-window-ms milliseconds. If there have, then the job
should fail, if not, then the job should go forward. Additionally, we might
also need to add an endTime in either RMAppAttemptImpl or
RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the
failure.
Thoughts?

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-07-29 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077453#comment-14077453
 ] 

Gera Shegalov commented on YARN-796:


Hi [~yufeldman], thanks for posting the patch. Please rebase it since it no 
longer applies.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077458#comment-14077458
 ] 

Yuliya Feldman commented on YARN-796:
-

Yes, noticed - will repost in a moment

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: YARN-796.patch.2

Patch to comply with svn

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: (was: YARN-796.patch.1)

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.2


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2215) Add preemption info to REST/CLI

2014-07-29 Thread Kenji Kikushima (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077481#comment-14077481
 ] 

Kenji Kikushima commented on YARN-2215:
---

Thanks for comments, [~leftnoteasy].
I will also make a patch for CLI support, but it does't depend on REST's dao. 
To make it clear, may I divide jira into REST support and CLI support?

 Add preemption info to REST/CLI
 ---

 Key: YARN-2215
 URL: https://issues.apache.org/jira/browse/YARN-2215
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Kenji Kikushima
 Attachments: YARN-2215.patch


 As discussed in YARN-2181, we'd better to add preemption info to RM RESTful 
 API/CLI to make administrator/user get more understanding about preemption 
 happened on app/queue, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: (was: YARN-796.patch.2)

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: YARN-796.patch.3

Rebased from trunk

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Description: 
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:



2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)


Meanwhile ZooKeeps logs as the following:

2014-07-25 22:10:09,742 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
... ...
2014-07-25 22:33:10,966 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747


  was:
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Attachment: YARN-2368.patch

 ResourceManager failed when ZKRMStateStore tries to update znode data larger 
 than 1MB
 -

 Key: YARN-2368
 URL: https://issues.apache.org/jira/browse/YARN-2368
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Priority: Critical
 Attachments: YARN-2368.patch


 Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed 
 finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
 larger than 1MB, which is the default configuration of ZooKeeper server and 
 client in 'jute.maxbuffer'.
 ResourceManager log shows as the following:
 
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2014-07-25 22:33:11,214 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for 
 /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 Meanwhile ZooKeeps logs as the following:
 
 2014-07-25 22:10:09,742 [myid:1] - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
 causing close of session 0x247684586e70006 due to java.io.IOException: Len 
 error 1530747
 ... ...
 2014-07-25 22:33:10,966 [myid:1] - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
 causing close of session 0x247684586e70006 due to java.io.IOException: Len 
 error 1530747



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Description: 
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)



Meanwhile ZooKeeps logs as the following:

2014-07-25 22:10:09,742 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
... ...
2014-07-25 22:33:10,966 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747


  was:
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:



2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-29 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077580#comment-14077580
]

Junping Du commented on YARN-2209:
--

Thanks [~zjshen] for more details! I have similar comments above but [~jianhe]
mentioned RESYNC is just used for RM restart work which hasn't been released as
a completed feature. However, I checked our previous releases that even since
in 2.2 (may earlier), AM_RESYNC and AM_SHUTDOWN is already a public API that
could be used in customers' application. In this case, our changes here could
break the application - previously, it should show Resource Manager doesn't
recognize AttemptId: ... when RM getting restart (even no preserving work),
but now it shows something like Could not contact RM after ... milliseconds.
which sounds misleading. May be we should think some other compatible way? i.e.
add a new API to ApplicationMasterProtocol which throw exceptions instead of
AMCommand. The old API still get supported for backward compatibility. Thoughts?

Replace AM resync/shutdown command with corresponding exceptions

Key: YARN-2209
URL: https://issues.apache.org/jira/browse/YARN-2209
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch,
YARN-2209.4.patch, YARN-2209.5.patch

YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate
application to re-register on RM restart. we should do the same for
AMS#allocate call also.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077593#comment-14077593
 ] 

Hadoop QA commented on YARN-796:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658377/YARN-796.patch.3
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 4 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.TestRMAdminCLI

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4470//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4470//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4470//console

This message is automatically generated.

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.3


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077616#comment-14077616
 ] 

Hadoop QA commented on YARN-2368:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658387/YARN-2368.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4471//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4471//console

This message is automatically generated.

 ResourceManager failed when ZKRMStateStore tries to update znode data larger 
 than 1MB
 -

 Key: YARN-2368
 URL: https://issues.apache.org/jira/browse/YARN-2368
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Priority: Critical
 Attachments: YARN-2368.patch


 Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed 
 finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
 larger than 1MB, which is the default configuration of ZooKeeper server and 
 client in 'jute.maxbuffer'.
 ResourceManager log shows as the following:
 
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2014-07-25 22:33:11,214 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for 
 /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Description: 
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)



Meanwhile, ZooKeeps logs as the following:

2014-07-25 22:10:09,742 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
... ...
2014-07-25 22:33:10,966 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747


  was:
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Description: 
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)



Meanwhile, ZooKeeps log shows as the following:

2014-07-25 22:10:09,742 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
... ...
2014-07-25 22:33:10,966 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747


  was:
Both ResouceManager throws out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at

[jira] [Created] (YARN-2369) Environment variable handling assumes values should be appended

2014-07-29 Thread Jason Lowe (JIRA)

Jason Lowe created YARN-2369:


 Summary: Environment variable handling assumes values should be 
appended
 Key: YARN-2369
 URL: https://issues.apache.org/jira/browse/YARN-2369
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Jason Lowe


When processing environment variables for a container context the code assumes 
that the value should be appended to any pre-existing value in the environment. 
 This may be desired behavior for handling path-like environment variables such 
as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful 
way to handle any variable that does not have path-like semantics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended

2014-07-29 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077729#comment-14077729
 ] 

Jason Lowe commented on YARN-2369:
--

The code in question is in org.apache.hadoop.yarn.util.Apps#addToEnvironment:

{code}
  public static void addToEnvironment(
  MapString, String environment,
  String variable, String value, String classPathSeparator) {
String val = environment.get(variable);
if (val == null) {
  val = value;
} else {
  val = val + classPathSeparator + value;
}
environment.put(StringInterner.weakIntern(variable), 
StringInterner.weakIntern(val));
  }
{code}

This has very surprising results for any variable that isn't path-like.  For 
example, we ran across a MapReduce job that had something like this in its 
environment settings:

yarn.app.mapreduce.am.env='JAVA_HOME=/inst/jdk,JAVA_HOME=/inst/jdk'

Rather than ending up with JAVA_HOME=/inst/jdk as one would expect, JAVA_HOME 
instead was set to /inst/jdk:/inst/jdk which completely broke the job.

It seems to me that we should either use a whitelist of variables that support 
appending or never append settings.  For the latter case if users desire values 
to be appended then they can ask for it explicitly in their variable settings, 
like one of these forms depending upon whether they want client-side 
environment variable expansion or container-side environment variable expansion:
{noformat}
PATH='$PATH:/my/extra/path'
PATH='{{PATH}}:/my/extra/path'
{noformat}

 Environment variable handling assumes values should be appended
 ---

 Key: YARN-2369
 URL: https://issues.apache.org/jira/browse/YARN-2369
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Jason Lowe

 When processing environment variables for a container context the code 
 assumes that the value should be appended to any pre-existing value in the 
 environment.  This may be desired behavior for handling path-like environment 
 variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a 
 non-intuitive and harmful way to handle any variable that does not have 
 path-like semantics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended

2014-07-29 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077736#comment-14077736
 ] 

Allen Wittenauer commented on YARN-2369:


Post-HADOOP-9902, I wonder if it would be worthwhile to have this configured at 
the shell level. i.e., Have something that both the Java code and the shell 
could could read that would list the semantics of each known/important shell 
var.  This way both could be smarter about overwrite vs. append vs. dedupe.

 Environment variable handling assumes values should be appended
 ---

 Key: YARN-2369
 URL: https://issues.apache.org/jira/browse/YARN-2369
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Jason Lowe

 When processing environment variables for a container context the code 
 assumes that the value should be appended to any pre-existing value in the 
 environment.  This may be desired behavior for handling path-like environment 
 variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a 
 non-intuitive and harmful way to handle any variable that does not have 
 path-like semantics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java

2014-07-29 Thread Wenwu Peng (JIRA)

Wenwu Peng created YARN-2370:


 Summary: Fix comment more accurate in AppSchedulingInfo.java
 Key: YARN-2370
 URL: https://issues.apache.org/jira/browse/YARN-2370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
Priority: Trivial


in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update 
OffRack request,  the comment should be Update cloned OffRack requests for 
recovery not Update cloned RackLocal  and OffRack requests for recover 
{code} 
// Update cloned RackLocal and OffRack requests for recovery
 resourceRequests.add(cloneResourceRequest(offSwitchRequest));
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-29 Thread Zhijie Shen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077797#comment-14077797
]

Zhijie Shen commented on YARN-2209:
---

[~djp], thanks for sharing the your idea.

bq. However, I checked our previous releases that even since in 2.2 (may
earlier), AM_RESYNC and AM_SHUTDOWN is already a public API that could be used
in customers' application.

I think AMCommand is in the codebase since 2.1. I think [~jianhe] meant the new
logic for RESYNC case is committed recently.

bq. i.e. add a new API to ApplicationMasterProtocol which throw exceptions
instead of AMCommand. The old API still get supported for backward
compatibility.

IMHO, it sounds an overcorrection for code refactoring work.

I think the essential problem here is whether throwing new sub exception which
may not be handled before is an acceptable incompatible change, and therefore
whether it is worth trading it for code refactoring. Thoughts?

Replace AM resync/shutdown command with corresponding exceptions

YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate
application to re-register on RM restart. we should do the same for
AMS#allocate call also.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-07-29 Thread Ashwin Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2360:
-

Attachment: YARN-2360-v2.txt

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt, YARN-2360-v2.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page


[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077982#comment-14077982
 ] 

Hadoop QA commented on YARN-2360:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658443/YARN-2360-v2.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4472//console

This message is automatically generated.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt, YARN-2360-v2.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-29 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078046#comment-14078046
 ] 

Jian He commented on YARN-2209:
---

{code}
try {
ams.allocate(...);
  catch (Exception e) {
ams.finishApplicationMaster(...)
  }
  if (response is shutdown/resync) {
// cleanup and reboot ...
  }
{code}
The example you mentioned here will continue to work because 
ams.finishApplicationMaster won't be able to go through. Thus, AM container 
eventually gets killed and RM will still retry this application. 
So the main point is, regardless how application is earlier handling the 
AMCommand, it should continue to work with this change.  Existing YARN 
applications will not break because of this. 
In fact, I think AMCommand#shutdown itself is a nondeterministic command, 
because AM may possibly get killed before it can do anything to process this 
command . Application should not rely on this command.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-29 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078051#comment-14078051
 ] 

Karthik Kambatla commented on YARN-2328:


Had an offline discussion with Sandy. This approach of having a single 
background-tasks-thread would adversely affect continuous scheduling as that 
would be gated on a sleep and often-longer update. 

Will go ahead and commit yarn-2328-2.patch that Sandy already +1ed.


 FairScheduler: Verify update and continuous scheduling threads are stopped 
 when the scheduler is stopped
 

 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, 
 yarn-2328-preview.patch


 FairScheduler threads can use a little cleanup and tests. To begin with, the 
 update and continuous-scheduling threads should extend Thread and handle 
 being interrupted. We should have tests for starting and stopping them as 
 well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078048#comment-14078048
 ] 

Craig Welch commented on YARN-1994:
---

It used to be the case that the server's InetsocketAddress would be the 
hostname, but now it can be 0.0.0.0 in bind-host cases, so it can no longer get 
it the way is used to before the change.  There might be a simpler way to do it 
than the one here, but what is here does appear to work  I think it's as good 
an approach as any - this service is not quite the same as the others in terms 
of how this part of the code works.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-29 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla resolved YARN-2328.


Resolution: Fixed

Committed yarn-2328-2.patch to trunk and branch-2.

Thanks for the review and offline discussions, Sandy. 

 FairScheduler: Verify update and continuous scheduling threads are stopped 
 when the scheduler is stopped
 

 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, 
 yarn-2328-preview.patch


 FairScheduler threads can use a little cleanup and tests. To begin with, the 
 update and continuous-scheduling threads should extend Thread and handle 
 being interrupted. We should have tests for starting and stopping them as 
 well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078066#comment-14078066
 ] 

Craig Welch commented on YARN-1994:
---

On further investigation, the getCanonicalHostname() code a little further down 
is what is responsible for making this section work - and it should work with 
any valid bind-host configuration (as it needs to be something on the host 
which is a valid listening specification - or it will have already failed 
during the server setup).  I think this change is also unnecessary, I'm going 
to take it out and test to verify.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, 
 YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078072#comment-14078072
 ] 

Hudson commented on YARN-2328:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5982 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5982/])
YARN-2328. FairScheduler: Verify update and continuous scheduling threads are 
stopped when the scheduler is stopped. (kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614432)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 FairScheduler: Verify update and continuous scheduling threads are stopped 
 when the scheduler is stopped
 

 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, 
 yarn-2328-preview.patch


 FairScheduler threads can use a little cleanup and tests. To begin with, the 
 update and continuous-scheduling threads should extend Thread and handle 
 being interrupted. We should have tests for starting and stopping them as 
 well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


 [ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-1994:
--

Attachment: YARN-1994.12.patch

Yup, that change is not needed, nice catch Arpit.  Attached is .12 without it.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-29 Thread Arpit Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078195#comment-14078195
 ] 

Arpit Agarwal commented on YARN-1994:
-

+1 for the latest patch, thanks for all the patch iterations Craig! :-)


 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-29 Thread Milan Potocnik (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078240#comment-14078240
]

Milan Potocnik commented on YARN-1994:
--

Hi [~cwelch], [~arpitagarwal],

I think some clarification is needed here. Initial reason we wanted to
introduce _BIND_HOST options was to provide deterministic behaviors when
clients try to connect to a service endpoint which is listening on all
interfaces (0.0.0.0). In short, _BIND_HOST is what services use to bind,
_ADDRESS is what clients should use to connect. This way, everything is
deterministic.

In Multi NIC environments, with default implementation, calls to
conf.updateConnectAddress for 0.0.0.0 address would eventually call
InetSocketAddress.getHostName(). In MultiNIC environments, this can introduce
non-deterministic behavior. Imagine you have DNS entries for each of the
network interfaces and although you bind your service endpoint to all of them,
you want users to use a specific one (for instance, InfiniBand for better
performance). InetSocketAddress.getHostName() will return just the machine's
hostname which will usually resolve to some random network interface of the
service when the client resolves it. Although service binds to 0.0.0.0, some
interfaces might be disabled by firewall.

This is why besides RPCUtil.getSocketAddress, we also need
RPCUtil.updateConnectAddr to explicitly specify connect address which clients
should use, i.e. a DNS entry pointing to a specific interface.

There are also two cases in the code, where current implementation does not
work in MultiNIC environments which we fixed:
- MRClientService TaskAttemptListenerImpl where we had to propagate NM
hostname through context,
- which is set in ContainerManagerImpl via NodeId from
YarnConfiguration.NM_ADDRESS (the logic Arpit mentioned in the comment)

Please have a look the the patch version 5, for easier understanding.

Hope this clarifies the initial idea.

Thanks,
Milan

Expose YARN/MR endpoints on multiple interfaces
---

Key: YARN-1994
URL: https://issues.apache.org/jira/browse/YARN-1994
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
Attachments: YARN-1994.0.patch, YARN-1994.1.patch,
YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch,
YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch,
YARN-1994.6.patch, YARN-1994.7.patch

YARN and MapReduce daemons currently do not support specifying a wildcard
address for the server endpoints. This prevents the endpoints from being
accessible from all interfaces on a multihomed machine.
Note that if we do specify INADDR_ANY for any of the options, it will break
clients as they will attempt to connect to 0.0.0.0. We need a solution that
allows specifying a hostname or IP-address for clients while requesting
wildcard bind for the servers.
(List of endpoints is in a comment below)

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078272#comment-14078272
 ] 

Craig Welch commented on YARN-1994:
---

Milan
At present multi-home only works if all interfaces on a box have the same 
names, managing which address a client will connect to has to be managed by 
controlling what address this will resolve to for a particular class of 
clients.  This is necessary because there are cases where links and redirects 
are generated based on names, and for this to operate for all clients on all 
networks the names for the hadoop hosts must be the same everywhere.  For it to 
work in any other way would require logic to use a particular name depending on 
the source network of the client when generating links, and that is not in the 
current scope (there would be other complexity around managing multiple names 
for the same host/service as well, which would be problematic).  Since the only 
way for multi-home to work properly at this point is for the host to have the 
same name on all networks it can be accessed from the additional logic is 
unnecessary - when properly configured, it will always return the same name.  
The same is true for the container manager impl, mrclient service, task 
attempt, etc

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078307#comment-14078307
 ] 

Hadoop QA commented on YARN-1994:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658462/YARN-1994.12.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4473//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4473//console

This message is automatically generated.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078337#comment-14078337
 ] 

Craig Welch commented on YARN-1994:
---

[~mipoto]
Trying to understand the scenario here - is it the case that you have a host 
with multiple interfaces where some resolve on the host to a different name 
from the actual host name, but for some clients on other networks it resolves 
to the same name as the box (so, to a different name than the box sees for it's 
own interface)?  Do I understand properly that the basic issue is that there is 
an interface on the box with a name different from what you want to use based 
on the address, but you do somehow want to bind to that name and will be able 
to use it from a client somewhere?  For that to happen, the client would need 
to resolve the name differently and/or have a different yarn config with a 
different address in it, is that what is happening?

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2367) Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler

2014-07-29 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-2367.
--

Resolution: Not a Problem

Hi Swapnil,
The Fair Scheduler supports this through a different interface.  Scheduling 
policies can be configured at any queue level in the hierarchy.

In general, the FIFO scheduler lacks most of the advanced functionality of the 
Fair and Capacity schedulers.  My opinion is that achieving parity is a 
non-goal.  If you think this shouldn't be the case, feel free to reopen this 
JIRA under a name like Support multi-resource scheduling in the FIFO 
scheduler and we can discuss whether that's worth embarking on.

 Make ResourceCalculator configurable for FairScheduler and FifoScheduler like 
 CapacityScheduler
 ---

 Key: YARN-2367
 URL: https://issues.apache.org/jira/browse/YARN-2367
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.2.0, 2.3.0, 2.4.1
Reporter: Swapnil Daingade
Priority: Minor

 The ResourceCalculator used by CapacityScheduler is read from a configuration 
 file entry capacity-scheduler.xml 
 yarn.scheduler.capacity.resource-calculator. This allows for custom 
 implementations that implement the ResourceCalculator interface to be plugged 
 in. It would be nice to have the same functionality in FairScheduler and 
 FifoScheduler.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces

2014-07-29 Thread Milan Potocnik (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078412#comment-14078412
 ] 

Milan Potocnik commented on YARN-1994:
--

[~cwelch]
I'll try to explain one of the use cases.

Let's say we have following interfaces in our network:
 - 1 ethernet, public network
 - 2 IB, private network. Please note that on Windows, IB does not support 
teaming

On DNS Server, DNS entry for machine's hostname can resolve to any of the three 
interfaces (for each 'hostname' entry - three IP addresses). We also add a 
special DNS entry for each machine that resolves only to two IB interfaces, 
let's say in the form of 'hostname-IB'.

Use case 1: We want internal communication in the cluster to always use IB. We 
also want to be fault tolerant if one of the IB fails (remember, no teaming on 
Windows). In order to bind to both IB interfaces, we must set bind address to 
0.0.0.0. When this is set, clients when connecting will currently get hostname, 
which in some cases (DNS server usually returns IPs by round-robin) will 
resolve to Ethernet IP address, which could be blocked by firewall, or it might 
degrade performance in internal communication.

By setting _BIND_HOST to 0.0.0.0 and _ADDRESS to 'hostname-IB' we avoid the 
non-determinism of InetSocketAddress.getHostName()

For outside clients we can also control connectivity by making sure they 
connect via the public network, but this is a simpler problem, since they would 
use different DNS server.



 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically

2014-07-29 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2212:


Attachment: YARN-2212.4.patch

 ApplicationMaster needs to find a way to update the AMRMToken periodically
 --

 Key: YARN-2212
 URL: https://issues.apache.org/jira/browse/YARN-2212
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2212.1.patch, YARN-2212.2.patch, 
 YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-29 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078625#comment-14078625
 ] 

Tsuyoshi OZAWA commented on YARN-2328:
--

{quote}
Had an offline discussion with Sandy. This approach of having a single 
background-tasks-thread would adversely affect continuous scheduling as that 
would be gated on a sleep and often-longer update.
{quote}

Checked preview patch. It makes sense to me. Thanks for taking this issue, 
Karthik.

 FairScheduler: Verify update and continuous scheduling threads are stopped 
 when the scheduler is stopped
 

 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, 
 yarn-2328-preview.patch


 FairScheduler threads can use a little cleanup and tests. To begin with, the 
 update and continuous-scheduling threads should extend Thread and handle 
 being interrupted. We should have tests for starting and stopping them as 
 well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-29 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078648#comment-14078648
 ] 

Jian He commented on YARN-2354:
---

Hi Li, the test failures are different from previous one, can you make sure 
these failures are not related to this patch ? th!

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: (was: YARN-796.patch.3)

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests


 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuliya Feldman updated YARN-796:


Attachment: YARN-796.patch4

Fixing failed Test, FindBugs and JavaDocs

 Allow for (admin) labels on nodes and resource-requests
 ---

 Key: YARN-796
 URL: https://issues.apache.org/jira/browse/YARN-796
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Arun C Murthy
Assignee: Wangda Tan
 Attachments: LabelBasedScheduling.pdf, 
 Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch4


 It will be useful for admins to specify labels for nodes. Examples of labels 
 are OS, processor architecture etc.
 We should expose these labels and allow applications to specify labels on 
 resource-requests.
 Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-29 Thread Li Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2354:


Attachment: YARN-2354-072914.patch

Upload the same patch one more time to see if the unknown host exception 
reappears. 

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, 
 YARN-2354-072914.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces


[ 
https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078685#comment-14078685
 ] 

Craig Welch commented on YARN-1994:
---

Hmm, I think I understand the scenario.  As I understand it, this will only be 
a problem if you want to refer to the host by a value other than the main 
hostname, e.g. something other than what would be returned from 
InetAddress.getLocalHost().getHostName() - as that is what it will come back 
with in the bind-host 0.0.0.0 case.  (It's not that the results are 
indeterminate - it returns the primary host name, it's that you want to use 
something other than the primary host name).  There are many places in the code 
where this convention is used to determine the name of the host - it's an 
implicit convention at least, and I'm concerned that there will be cases where 
this mismatch causes issues, as evidenced by all of the various things which 
needed to be tracked down/etc.  Also, this is only realistic for single address 
services (or rather for cases where addresses are enumerated, including single 
addresses or ha resource manager, etc), others will not have an address in a 
centralized configuration to override them.

Instead of this, could you use the primary hostname and control what address it 
uses through name resolution?  Inside the cluster the hostname should resolve 
to only the infiniband addresses, that way only those will be used (the two ib 
addresses).  From external networks you can set the resolution to the ethernet 
address as you mentioned above.  This way, the host is always referred to by 
the same name, clients are always able to get to it over the desired interface, 
and no changes to the connect address logic are required.

 Expose YARN/MR endpoints on multiple interfaces
 ---

 Key: YARN-1994
 URL: https://issues.apache.org/jira/browse/YARN-1994
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Arpit Agarwal
Assignee: Craig Welch
 Attachments: YARN-1994.0.patch, YARN-1994.1.patch, 
 YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.12.patch, 
 YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, YARN-1994.5.patch, 
 YARN-1994.6.patch, YARN-1994.7.patch


 YARN and MapReduce daemons currently do not support specifying a wildcard 
 address for the server endpoints. This prevents the endpoints from being 
 accessible from all interfaces on a multihomed machine.
 Note that if we do specify INADDR_ANY for any of the options, it will break 
 clients as they will attempt to connect to 0.0.0.0. We need a solution that 
 allows specifying a hostname or IP-address for clients while requesting 
 wildcard bind for the servers.
 (List of endpoints is in a comment below)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2328) FairScheduler: Verify update and continuous scheduling threads are stopped when the scheduler is stopped

2014-07-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078708#comment-14078708
 ] 

Hudson commented on YARN-2328:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1820 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1820/])
YARN-2328. FairScheduler: Verify update and continuous scheduling threads are 
stopped when the scheduler is stopped. (kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614432)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 FairScheduler: Verify update and continuous scheduling threads are stopped 
 when the scheduler is stopped
 

 Key: YARN-2328
 URL: https://issues.apache.org/jira/browse/YARN-2328
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Attachments: yarn-2328-1.patch, yarn-2328-2.patch, yarn-2328-2.patch, 
 yarn-2328-preview.patch


 FairScheduler threads can use a little cleanup and tests. To begin with, the 
 update and continuous-scheduling threads should extend Thread and handle 
 being interrupted. We should have tests for starting and stopping them as 
 well. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically


[ 
https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078714#comment-14078714
 ] 

Hadoop QA commented on YARN-2212:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658522/YARN-2212.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4474//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4474//console

This message is automatically generated.

 ApplicationMaster needs to find a way to update the AMRMToken periodically
 --

 Key: YARN-2212
 URL: https://issues.apache.org/jira/browse/YARN-2212
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2212.1.patch, YARN-2212.2.patch, 
 YARN-2212.3.1.patch, YARN-2212.3.patch, YARN-2212.4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions

2014-07-29 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078727#comment-14078727
]

Junping Du commented on YARN-2209:
--

bq. I think the essential problem here is whether throwing new sub exception
which may not be handled before is an acceptable incompatible change, and
therefore whether it is worth trading it for code refactoring. Thoughts?
I think it depends on if any behavior get changed (code functionality, log
exception, etc.) in old client side. If old client doesn't aware this new sub
exception and still catch it as the parent exception and do the same logic
which seems to be fine. However, unfortunately, I concerned the case here
doesn't belongs to this. Let's assume user's AM copy code from
RMContainerAllocator, it treat YARNException and RESYNC/SHUTDOWN in response
differently (with different exceptions to indicate different reasons). Now, the
same reason that cause RESYNC/SHUTDOWN previously get thrown YARNExcetpion
(though a new sub exception) in new RM and will be handle as the same as other
reasons. So now the old application can only see one exception which is
inconsistent behavior.

bq. So the main point is, regardless how application is earlier handling the
AMCommand, it should continue to work with this change. Existing YARN
applications will not break because of this.
IMO, breaking application doesn't means to break app's functionality only.
Application's inconsistent behavior in the same exception belongs to this case
also. In general, we may think the latter one doesn't sounds so serious as
previous one but it affect user's experience (especially they meet exception
before). It may worth to do so for serious bug fix or significant feature, but
we should have more justifications for code refactor work.

Replace AM resync/shutdown command with corresponding exceptions

YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate
application to re-register on RM restart. we should do the same for
AMS#allocate call also.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts


[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078740#comment-14078740
 ] 

Hadoop QA commented on YARN-2354:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12658544/YARN-2354-072914.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell:

  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4476//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4476//console

This message is automatically generated.

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, 
 YARN-2354-072914.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leitao Guo updated YARN-2368:
-

Description: 
Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager (ip addr: 10.153.80.8) log shows as the following:
{code}
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2014-07-25 22:33:11,078 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session restored
2014-07-25 22:33:11,214 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
{code}


Meanwhile, ZooKeeps log shows as the following:
{code}
2014-07-25 22:10:09,728 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /10.153.80.8:58890
2014-07-25 22:10:09,730 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
2014-07-25 22:10:09,730 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 
0x247684586e70006
2014-07-25 22:10:09,730 [myid:1] - INFO  
[QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
2014-07-25 22:10:09,730 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
packet /10.153.80.8:58890
2014-07-25 22:10:09,730 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success 
/10.153.80.8:58890
2014-07-25 22:10:09,742 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530
747
2014-07-25 22:10:09,743 [myid:1] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket 
connection for client /10.153.80.8:58890 which had sessionid 0x247684586e70006
... ...
2014-07-25 22:33:10,966 [myid:1] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
causing close of session 0x247684586e70006 due to java.io.IOException: Len 
error 1530747
{code}

  was:
Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. 
ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 
1MB, which is the default configuration of ZooKeeper server and client in 
'jute.maxbuffer'.

ResourceManager log shows as the following:

2014-07-25 22:33:11,078 INFO

[jira] [Updated] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java

2014-07-29 Thread Wenwu Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenwu Peng updated YARN-2370:
-

Attachment: YARN-2370.0.patch

This is only minor update. Thanks in advance for review

 Fix comment more accurate in AppSchedulingInfo.java
 ---

 Key: YARN-2370
 URL: https://issues.apache.org/jira/browse/YARN-2370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
Priority: Trivial
 Attachments: YARN-2370.0.patch


 in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update 
 OffRack request,  the comment should be Update cloned OffRack requests for 
 recovery not Update cloned RackLocal  and OffRack requests for recover 
 {code} 
 // Update cloned RackLocal and OffRack requests for recovery
  resourceRequests.add(cloneResourceRequest(offSwitchRequest));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078804#comment-14078804
 ] 

Tsuyoshi OZAWA commented on YARN-2368:
--

Thanks for your contribution, [~breno.leitao]! Could you explain the condition 
you faced this problem? If we face this problem very often, it's critical 
problem of ZKRMStateStore. However, data stored in ZKRMStateStore are small 
basically. Therefore, I think it's strange that this kind of problem appear. 
Additionally, if the max data size is fixed, we should make it default value 
not to face this problem.

 ResourceManager failed when ZKRMStateStore tries to update znode data larger 
 than 1MB
 -

 Key: YARN-2368
 URL: https://issues.apache.org/jira/browse/YARN-2368
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Priority: Critical
 Attachments: YARN-2368.patch


 Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
 finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
 larger than 1MB, which is the default configuration of ZooKeeper server and 
 client in 'jute.maxbuffer'.
 ResourceManager (ip addr: 10.153.80.8) log shows as the following:
 {code}
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2014-07-25 22:33:11,214 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for 
 /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Meanwhile, ZooKeeps log shows as the following:
 {code}
 2014-07-25 22:10:09,728 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
 Accepted socket connection from /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
 attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
 client: 0x247684586e70006
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
 packet /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
 success /10.153.80.8:58890
 2014-07-25 22:10:09,742 [myid:1] - WARN

[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests


[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078807#comment-14078807
 ] 

Hadoop QA commented on YARN-796:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658538/YARN-796.patch4
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.mapred.TestMRWithDistributedCache
  org.apache.hadoop.mapred.TestJobClientGetJob
  org.apache.hadoop.mapred.TestLocalModeWithNewApis
  org.apache.hadoop.mapreduce.TestMapReduce
  org.apache.hadoop.mapreduce.lib.input.TestLineRecordReaderJobs
  org.apache.hadoop.mapred.jobcontrol.TestLocalJobControl
  org.apache.hadoop.mapred.TestJobCounters
  org.apache.hadoop.mapred.TestLocalMRNotification
  org.apache.hadoop.mapred.lib.TestDelegatingInputFormat
  org.apache.hadoop.mapred.TestReduceFetch
  org.apache.hadoop.mapreduce.TestMapReduceLazyOutput
  org.apache.hadoop.mapreduce.lib.join.TestJoinProperties
  org.apache.hadoop.mapred.lib.TestMultithreadedMapRunner
  org.apache.hadoop.mapred.TestClusterMRNotification
  org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner
  org.apache.hadoop.mapreduce.lib.chain.TestSingleElementChain
  org.apache.hadoop.mapreduce.TestMapperReducerCleanup
  org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
  org.apache.hadoop.mapreduce.v2.TestMRJobsWithProfiler
  org.apache.hadoop.fs.TestFileSystem
  org.apache.hadoop.mapreduce.TestLargeSort
  org.apache.hadoop.mapred.join.TestDatamerge
  org.apache.hadoop.mapreduce.lib.input.TestMultipleInputs
  org.apache.hadoop.mapred.TestLazyOutput
  org.apache.hadoop.mapred.TestTaskCommit
  org.apache.hadoop.mapreduce.TestMRJobClient
  org.apache.hadoop.mapreduce.security.TestMRCredentials
  org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers
  org.apache.hadoop.mapred.lib.TestChainMapReduce
  org.apache.hadoop.mapreduce.lib.fieldsel.TestMRFieldSelection
  
org.apache.hadoop.mapreduce.lib.partition.TestMRKeyFieldBasedComparator
  org.apache.hadoop.mapreduce.lib.db.TestDataDrivenDBInputFormat
  org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath
  org.apache.hadoop.mapreduce.v2.TestMRJobs
  org.apache.hadoop.mapred.TestMapRed
  org.apache.hadoop.mapred.lib.TestKeyFieldBasedComparator
  
org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat
  org.apache.hadoop.mapreduce.v2.TestNonExistentJob
  
org.apache.hadoop.mapreduce.lib.input.TestDelegatingInputFormat
  org.apache.hadoop.mapred.TestMiniMRChildTask
  org.apache.hadoop.fs.slive.TestSlive
  org.apache.hadoop.mapred.TestComparators
  org.apache.hadoop.mapreduce.v2.TestUberAM
  org.apache.hadoop.mapred.TestMiniMRClasspath
  org.apache.hadoop.mapred.TestMapOutputType
  org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter

[jira] [Commented] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2014-07-29 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078805#comment-14078805
 ] 

Tsuyoshi OZAWA commented on YARN-2368:
--

[~breno.leitao], oops, sorry for calling you wrongly. I tried to mention 
[~guoleitao].


 ResourceManager failed when ZKRMStateStore tries to update znode data larger 
 than 1MB
 -

 Key: YARN-2368
 URL: https://issues.apache.org/jira/browse/YARN-2368
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: Leitao Guo
Priority: Critical
 Attachments: YARN-2368.patch


 Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
 finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
 larger than 1MB, which is the default configuration of ZooKeeper server and 
 client in 'jute.maxbuffer'.
 ResourceManager (ip addr: 10.153.80.8) log shows as the following:
 {code}
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2014-07-25 22:33:11,078 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2014-07-25 22:33:11,214 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for 
 /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
 at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Meanwhile, ZooKeeps log shows as the following:
 {code}
 2014-07-25 22:10:09,728 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
 Accepted socket connection from /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
 attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
 client: 0x247684586e70006
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
 packet /10.153.80.8:58890
 2014-07-25 22:10:09,730 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
 success /10.153.80.8:58890
 2014-07-25 22:10:09,742 [myid:1] - WARN  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
 causing close of session 0x247684586e70006 due to java.io.IOException: Len 
 error 1530
 747
 2014-07-25 22:10:09,743 [myid:1] - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed 
 socket connection for client

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-07-29 Thread Richard Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Attachment: hadoop_job_suspend_resume.patch

Hadoop Job Suspend and Resume svn patch file

 Suspend/Resume Hadoop Jobs
 --

 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
  Labels: hadoop, jobs, resume, suspend
 Fix For: 2.2.0

 Attachments: Hadoop Job Suspend Resume Design.docx, 
 hadoop_job_suspend_resume.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 In a multi-application cluster environment, jobs running inside Hadoop YARN 
 may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
 give way to other higher-priority jobs inside Hadoop, a user or some 
 cluster-level resource scheduling service should be able to suspend and/or 
 resume some particular jobs within Hadoop YARN.
 When target jobs inside Hadoop are suspended, those already allocated and 
 running task containers will continue to run until their completion or active 
 preemption by other ways. But no more new containers would be allocated to 
 the target jobs. In contrast, when suspended jobs are put into resume mode, 
 they will continue to run from the previous job progress and have new task 
 containers allocated to complete the rest of the jobs.
 My team has completed its implementation and our tests showed it works in a 
 rather solid way. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-07-29 Thread Richard Chen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Richard Chen updated YARN-2172:
---

Description:
In a multi-application cluster environment, jobs running inside Hadoop YARN may
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give
way to other higher-priority jobs inside Hadoop, a user or some cluster-level
resource scheduling service should be able to suspend and/or resume some
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and
running task containers will continue to run until their completion or active
preemption by other ways. But no more new containers would be allocated to the
target jobs. In contrast, when suspended jobs are put into resume mode, they
will continue to run from the previous job progress and have new task
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it works in a
rather solid and convenient way.

was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give
way to other higher-priority jobs inside Hadoop, a user or some cluster-level
resource scheduling service should be able to suspend and/or resume some
particular jobs within Hadoop YARN.

My team has completed its implementation and our tests showed it works in a
rather solid way.

Suspend/Resume Hadoop Jobs
--

Key: YARN-2172
URL: https://issues.apache.org/jira/browse/YARN-2172
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager, webapp
Affects Versions: 2.2.0
Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
Labels: hadoop, jobs, resume, suspend
Fix For: 2.2.0

Attachments: Hadoop Job Suspend Resume Design.docx,
hadoop_job_suspend_resume.patch

Original Estimate: 336h
Remaining Estimate: 336h

In a multi-application cluster environment, jobs running inside Hadoop YARN
may be of lower-priority than jobs running outside Hadoop YARN like HBase. To
give way to other higher-priority jobs inside Hadoop, a user or some
cluster-level resource scheduling service should be able to suspend and/or
resume some particular jobs within Hadoop YARN.
When target jobs inside Hadoop are suspended, those already allocated and
running task containers will continue to run until their completion or active
preemption by other ways. But no more new containers would be allocated to
the target jobs. In contrast, when suspended jobs are put into resume mode,
they will continue to run from the previous job progress and have new task
containers allocated to complete the rest of the jobs.
My team has completed its implementation and our tests showed it works in a
rather solid and convenient way.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2370) Fix comment more accurate in AppSchedulingInfo.java


[ 
https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078840#comment-14078840
 ] 

Hadoop QA commented on YARN-2370:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658574/YARN-2370.0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
  
org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
  org.apache.hadoop.yarn.server.resourcemanager.TestRMHA
  
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4477//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4477//console

This message is automatically generated.

 Fix comment more accurate in AppSchedulingInfo.java
 ---

 Key: YARN-2370
 URL: https://issues.apache.org/jira/browse/YARN-2370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
Priority: Trivial
 Attachments: YARN-2370.0.patch


 in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update 
 OffRack request,  the comment should be Update cloned OffRack requests for 
 recovery not Update cloned RackLocal  and OffRack requests for recover 
 {code} 
 // Update cloned RackLocal and OffRack requests for recovery
  resourceRequests.add(cloneResourceRequest(offSwitchRequest));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2370) Made comment more accurate in AppSchedulingInfo.java

2014-07-29 Thread Wenwu Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenwu Peng updated YARN-2370:
-

Summary: Made comment more accurate in AppSchedulingInfo.java  (was: Fix 
comment more accurate in AppSchedulingInfo.java)

 Made comment more accurate in AppSchedulingInfo.java
 

 Key: YARN-2370
 URL: https://issues.apache.org/jira/browse/YARN-2370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Wenwu Peng
Assignee: Wenwu Peng
Priority: Trivial
 Attachments: YARN-2370.0.patch


 in method allocateOffSwitch of AppSchedulingInfo.java, only invoke update 
 OffRack request,  the comment should be Update cloned OffRack requests for 
 recovery not Update cloned RackLocal  and OffRack requests for recover 
 {code} 
 // Update cloned RackLocal and OffRack requests for recovery
  resourceRequests.add(cloneResourceRequest(offSwitchRequest));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-29 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078849#comment-14078849
 ] 

Jian He commented on YARN-2354:
---

looks good, checking this in.

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, 
 YARN-2354-072914.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-29 Thread Li Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078851#comment-14078851
 ] 

Li Lu commented on YARN-2354:
-

Thanks [~jianhe]. In the second time I think the only problem left is the 
persistent one, but unrelated to this patch. 

 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, 
 YARN-2354-072914.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts

2014-07-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078859#comment-14078859
 ] 

Hudson commented on YARN-2354:
--

FAILURE: Integrated in Hadoop-trunk-Commit #5983 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5983/])
YARN-2354. DistributedShell may allocate more containers than client specified 
after AM restarts. Contributed by Li Lu (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1614538)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSFailedAppMaster.java


 DistributedShell may allocate more containers than client specified after it 
 restarts
 -

 Key: YARN-2354
 URL: https://issues.apache.org/jira/browse/YARN-2354
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Li Lu
 Fix For: 2.6.0

 Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch, 
 YARN-2354-072914.patch


 To reproduce, run distributed shell with -num_containers option,
 In ApplicationMaster.java, the following code has some issue.
 {code}
   int numTotalContainersToRequest =
 numTotalContainers - previousAMRunningContainers.size();
 for (int i = 0; i  numTotalContainersToRequest; ++i) {
   ContainerRequest containerAsk = setupContainerAskForRM();
   amRMClient.addContainerRequest(containerAsk);
 }
 numRequestedContainers.set(numTotalContainersToRequest);
 {code}
  numRequestedContainers doesn't account for previous AM's requested 
 containers. so numRequestedContainers should be set to numTotalContainers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2370) Made comment more accurate in AppSchedulingInfo.java