[jira] [Commented] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-26 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226107#comment-14226107
 ] 

Tsuyoshi OZAWA commented on YARN-2404:
--

Thanks for your review, Jian!

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Fix For: 2.7.0

 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch, 
 YARN-2404.8.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2025) Possible NPE in schedulers#addApplicationAttempt()

2014-11-24 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223993#comment-14223993
 ] 

Tsuyoshi OZAWA commented on YARN-2025:
--

Thanks for your point, [~rohithsharma]. I'll take a look.

 Possible NPE in schedulers#addApplicationAttempt()
 --

 Key: YARN-2025
 URL: https://issues.apache.org/jira/browse/YARN-2025
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2025.1.patch


 In FifoScheduler/FairScheduler/CapacityScheduler#addApplicationAttempt(), we 
 don't check whether {{application}} is null. This can cause NPE in following 
 sequences: addApplication() - doneApplication() (e.g. AppKilledTransition) 
 - addApplicationAttempt().
 {code}
 SchedulerApplication application =
 applications.get(applicationAttemptId.getApplicationId());
 String user = application.getUser();
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-24 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2404:
-
Attachment: YARN-2404.8.patch

[~jianhe], good catch. Updated a patch.

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch, 
 YARN-2404.8.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2404:
-
Attachment: YARN-2404.7.patch

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222333#comment-14222333
 ] 

Tsuyoshi OZAWA commented on YARN-2404:
--

Jian, thanks for your review! Updated a patch to address your comments:

* Removing unused Credentials, ApplicationAttemptId, and always-true-assertion.
* Changed get/setAppAttemptTokens to accept/return Credentials instead of 
ByteBuffer. 

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch, YARN-2404.7.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-11-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222338#comment-14222338
 ] 

Tsuyoshi OZAWA commented on YARN-2517:
--

[~zjshen], thanks for your comment.

{code}
// Async call - Type (1)
void asyncCall(Input, CallBackHandler);
{code}

If we choose Type(1), TimelineClient need to manage all objects of callback 
handlers it's passed. It can get more complex. I think it's better to add 
{{registerAsyncCallbackHandler(CallBackHandler)}} to add TimelineClient 
separately or pass the callback via constructor for the simplicity. Or, do we 
have use cases to switch CallBackHandler for each method calls?

{quote}
Maybe compromise now is to add putEntitiesAsync to TimelineClient. In the 
future, let's see if we want to have a separate TimelineClientAsync that 
contains a bunch of async APIs.
{quote}

If we will have a plan to migrate from putEntitiesAsync to TimelineClientAsync, 
I think we should add TimelineClientAsync from the beginning. I think it's 
enough to add *Async methods to current TimelineClient for now since users 
don't need to create specific clients for async calls.

[~mitdesai] thanks for your help. Looking forward to your opinion. We're 
discussing the design based on [Vinod's 
suggestion|https://issues.apache.org/jira/browse/YARN-2517?focusedCommentId=14128819page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14128819].

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch, YARN-2517.2.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-20 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2404:
-
Attachment: YARN-2404.5.patch

Refreshed a patch.

I found that TestRMRestart#testAppRecoveredInOrderOnRMRestart fails after the 
refactoring since recoverApplication loads data from 
RMStateStore#RMState#appState, which is created as a instance of HashMap. We 
should make it TreeMap to preserve the restoring order by key, so I fixed it in 
this patch.

[~jianhe], could you take a look?

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class

2014-11-20 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2404:
-
Attachment: YARN-2404.6.patch

Fixed warnings by findbugs.

 Remove ApplicationAttemptState and ApplicationState class in RMStateStore 
 class 
 

 Key: YARN-2404
 URL: https://issues.apache.org/jira/browse/YARN-2404
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, 
 YARN-2404.4.patch, YARN-2404.5.patch, YARN-2404.6.patch


 We can remove ApplicationState and ApplicationAttemptState class in 
 RMStateStore, given that we already have ApplicationStateData and 
 ApplicationAttemptStateData records. we may just replace ApplicationState 
 with ApplicationStateData, similarly for ApplicationAttemptState.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2517) Implement TimelineClientAsync

2014-11-20 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2517:
-
Attachment: YARN-2517.2.patch

Sorry for the delay. Attached Future-based implementation for simplicity. I 
think this design is one of the best way we go to.

[~vinodkv], [~zjshen], should we add read APIs on another JIRA? And, do you 
have any opinions about Future-based design?



 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch, YARN-2517.2.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-11-20 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219554#comment-14219554
 ] 

Tsuyoshi OZAWA commented on YARN-2517:
--

[~mitdesai] I think you're one of the users of TimelineClient. If you have any 
feedbacks about the interface, please let me know.

 Implement TimelineClientAsync
 -

 Key: YARN-2517
 URL: https://issues.apache.org/jira/browse/YARN-2517
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2517.1.patch, YARN-2517.2.patch


 In some scenarios, we'd like to put timeline entities in another thread no to 
 block the current one.
 It's good to have a TimelineClientAsync like AMRMClientAsync and 
 NMClientAsync. It can buffer entities, put them in a separate thread, and 
 have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate

2014-11-18 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217448#comment-14217448
 ] 

Tsuyoshi OZAWA commented on YARN-2865:
--

[~rohithsharma], thanks for taking this issue. I'd like to +1 for adding 
Private and Unstable annotation to the methods defined in 
RMActiveServiceContext as Karthik mentioned. 

Otherwise points looks good to me.

 Application recovery continuously fails with Application with id already 
 present. Cannot duplicate
 

 Key: YARN-2865
 URL: https://issues.apache.org/jira/browse/YARN-2865
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Rohith
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2865.patch, YARN-2865.patch


 YARN-2588 handles exception thrown while transitioningToActive and reset 
 activeServices. But it misses out clearing RMcontext apps/nodes details and 
 ClusterMetrics and QueueMetrics. This causes application recovery to fail.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2800) Remove MemoryNodeLabelsStore and add a way to enable/disable node labels feature

2014-11-18 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217569#comment-14217569
 ] 

Tsuyoshi OZAWA commented on YARN-2800:
--

Thanks for your comment, Vinod and thanks for the patch, Wangda. +1 for 
removing MemoryNodeLabelsStore. 
My comments:
* MemoryRMNodeLabelsManager for tests do nothing in new patch. How about 
renaming MemoryRMNodeLabelsManager to NullRMNodeLabelsManager for the 
consistency with RMStateStore?
* Maybe not related to this JIRA, but it's better to add testing RMRestart with 
NodeLabelManager to avoid regressions. 



 Remove MemoryNodeLabelsStore and add a way to enable/disable node labels 
 feature
 

 Key: YARN-2800
 URL: https://issues.apache.org/jira/browse/YARN-2800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch, 
 YARN-2800-20141118-1.patch, YARN-2800-20141118-2.patch


 In the past, we have a MemoryNodeLabelStore, mostly for user to try this 
 feature without configuring where to store node labels on file system. It 
 seems convenient for user to try this, but actually it causes some bad use 
 experience. User may add/remove labels, and edit capacity-scheduler.xml. 
 After RM restart, labels will gone, (we store it in mem). And RM cannot get 
 started if we have some queue uses labels, and the labels don't exist in 
 cluster.
 As what we discussed, we should have an explicitly way to let user specify if 
 he/she wants this feature or not. If node label is disabled, any operations 
 trying to modify/use node labels will throw exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2607) TestDistributedShell fails in trunk

2014-11-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2607:
-
Attachment: test.log

[~leftnoteasy] yeah, it failed on trunk. Attaching a log.

 TestDistributedShell fails in trunk
 ---

 Key: YARN-2607
 URL: https://issues.apache.org/jira/browse/YARN-2607
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Wangda Tan
 Fix For: 2.6.0

 Attachments: YARN-2607-1.patch, YARN-2607-2.patch, YARN-2607-3.patch, 
 test.log


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console :
 {code}
 testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 35.641 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
 {code}
 On Linux, I got the following locally:
 {code}
 testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 64.715 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384)
 testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 115.842 sec   ERROR!
 java.lang.Exception: test timed out after 9 milliseconds
   at java.lang.Thread.sleep(Native Method)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342)
 testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 35.633 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode

2014-11-09 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204006#comment-14204006
 ] 

Tsuyoshi OZAWA commented on YARN-2830:
--

[~acmurthy] Let me review the latest patch.

 Add backwords compatible ContainerId.newInstance constructor for use within 
 Tez Local Mode
 --

 Key: YARN-2830
 URL: https://issues.apache.org/jira/browse/YARN-2830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Priority: Blocker
 Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, 
 YARN-2830-v3.patch, YARN-2830-v4.patch


 YARN-2229 modified the private unstable api for constructing. Tez uses this 
 api (shouldn't, but does) for use with Tez Local Mode. This causes a 
 NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose 
 we add the backwards compatible api since overflow is not a problem in tez 
 local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2607) TestDistributedShell fails in trunk

2014-11-09 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204073#comment-14204073
 ] 

Tsuyoshi OZAWA commented on YARN-2607:
--

TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression still 
fails on my local because of timeout.

 TestDistributedShell fails in trunk
 ---

 Key: YARN-2607
 URL: https://issues.apache.org/jira/browse/YARN-2607
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Wangda Tan
 Fix For: 2.6.0

 Attachments: YARN-2607-1.patch, YARN-2607-2.patch, YARN-2607-3.patch


 From https://builds.apache.org/job/Hadoop-Yarn-trunk/691/console :
 {code}
 testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 35.641 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
 {code}
 On Linux, I got the following locally:
 {code}
 testDSAttemptFailuresValidityIntervalFailed(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 64.715 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at org.junit.Assert.assertFalse(Assert.java:74)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalFailed(TestDistributedShell.java:384)
 testDSAttemptFailuresValidityIntervalSucess(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 115.842 sec   ERROR!
 java.lang.Exception: test timed out after 9 milliseconds
   at java.lang.Thread.sleep(Native Method)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:680)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:661)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSAttemptFailuresValidityIntervalSucess(TestDistributedShell.java:342)
 testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell)
   Time elapsed: 35.633 sec   FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at 
 org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:308)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode

2014-11-09 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204074#comment-14204074
 ] 

Tsuyoshi OZAWA commented on YARN-2830:
--

+1, the failure of TestApplicationClientProtocolOnHA is not related to the 
patch and succeeded on my local.

Vinod, thanks for pointing the link.

 Add backwords compatible ContainerId.newInstance constructor for use within 
 Tez Local Mode
 --

 Key: YARN-2830
 URL: https://issues.apache.org/jira/browse/YARN-2830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Priority: Blocker
 Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, 
 YARN-2830-v3.patch, YARN-2830-v4.patch


 YARN-2229 modified the private unstable api for constructing. Tez uses this 
 api (shouldn't, but does) for use with Tez Local Mode. This causes a 
 NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose 
 we add the backwards compatible api since overflow is not a problem in tez 
 local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode

2014-11-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202580#comment-14202580
 ] 

Tsuyoshi OZAWA commented on YARN-2830:
--

Hi [~jeagles], thanks for your reporting and contribution. I checked the source 
code of Tez Local Mode:

{code}
  ContainerId cId = ContainerId.newInstance(applicationAttemptId, 1);
{code}

I think it's more straight forward to fix as following at Tez than adding 
int-type newInstance method:

{code}
  ContainerId cId = ContainerId.newInstance(applicationAttemptId, 1L);
{code}

What do you think? cc: [~bikassaha]

 Add backwords compatible ContainerId.newInstance constructor for use within 
 Tez Local Mode
 --

 Key: YARN-2830
 URL: https://issues.apache.org/jira/browse/YARN-2830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Priority: Blocker
 Attachments: YARN-2830-v1.patch


 YARN-2229 modified the private unstable api for constructing. Tez uses this 
 api (shouldn't, but does) for use with Tez Local Mode. This causes a 
 NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose 
 we add the backwards compatible api since overflow is not a problem in tez 
 local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode

2014-11-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202599#comment-14202599
 ] 

Tsuyoshi OZAWA commented on YARN-2830:
--

[~jeagles] Make sense. Do you mind marking new method as Deprecated? It's 
useful for users not to use this method wrongly.

 Add backwords compatible ContainerId.newInstance constructor for use within 
 Tez Local Mode
 --

 Key: YARN-2830
 URL: https://issues.apache.org/jira/browse/YARN-2830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Priority: Blocker
 Attachments: YARN-2830-v1.patch


 YARN-2229 modified the private unstable api for constructing. Tez uses this 
 api (shouldn't, but does) for use with Tez Local Mode. This causes a 
 NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose 
 we add the backwards compatible api since overflow is not a problem in tez 
 local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2830) Add backwords compatible ContainerId.newInstance constructor for use within Tez Local Mode

2014-11-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203076#comment-14203076
 ] 

Tsuyoshi OZAWA commented on YARN-2830:
--

+1 for Sid's approach. 

[~jeagles], I'll review a patch after fixing the build failure.

 Add backwords compatible ContainerId.newInstance constructor for use within 
 Tez Local Mode
 --

 Key: YARN-2830
 URL: https://issues.apache.org/jira/browse/YARN-2830
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
Priority: Blocker
 Attachments: YARN-2830-v1.patch, YARN-2830-v2.patch, 
 YARN-2830-v3.patch


 YARN-2229 modified the private unstable api for constructing. Tez uses this 
 api (shouldn't, but does) for use with Tez Local Mode. This causes a 
 NoSuchMethod error when using Tez compiled against pre-2.6. Instead I propose 
 we add the backwards compatible api since overflow is not a problem in tez 
 local mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2801) Documentation development for Node labels requirment

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194705#comment-14194705
 ] 

Tsuyoshi OZAWA commented on YARN-2801:
--

[~gururaj] Great. Feel free to ask us if you have any problem or questions. 

 Documentation development for Node labels requirment
 

 Key: YARN-2801
 URL: https://issues.apache.org/jira/browse/YARN-2801
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: documentation
Reporter: Gururaj Shetty

 Documentation needs to be developed for the node label requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194719#comment-14194719
 ] 

Tsuyoshi OZAWA commented on YARN-2800:
--

[~leftnoteasy], thanks for taking this JIRA. IMHO, it's too verbose for users 
to dump the warning when users try the commands since users can use 
MemoryRMNodeLabelsManager experimentally. I prefer to log startup time once and 
show what kind of RMNodeLabelsManager on Web UI like RMStateStore. The patch on 
YARN-1326 can help you to add the type to Web UI.

 Should print WARN log in both RM/RMAdminCLI side when 
 MemoryRMNodeLabelsManager is enabled
 --

 Key: YARN-2800
 URL: https://issues.apache.org/jira/browse/YARN-2800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch


 Even though we have documented this, but it will be better to explicitly 
 print a message in both RM/RMAdminCLI side to explicitly say that the node 
 label being added will be lost across RM restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195130#comment-14195130
 ] 

Tsuyoshi OZAWA commented on YARN-2800:
--

[~leftnoteasy], thanks for your comments.

{quote}
+  + this message is based on the yarn-site.xml settings 
+  + in the machine you run \yarn rmadmin ...\, if you 
+  + already edited the field in yarn-site.xml of the node 
+  + running RM, please ignore this message.;
{quote}

I think printing the message based on client-side configuration can confuse the 
user - it can be different from RM-side. Every users doesn't have a copy of 
RM-side configuration and some users doesn't know the contents of RM-side 
configuration.

{quote}
But if user configured mem-based node labels manager, user may add labels to 
queue configurations, when RM will be failed to launch (specifically, CS cannot 
initialize) if a queue use a label but not existed in node labels manager
{quote}

Let me clarify this case - do you mean RM will fail to allocate containers on 
labeled nodes after RM restart since RM uses MemoryRMNodeLabelsManager and 
forget the mapping of node-to-labels? In this case, I think we should arise the 
warnings to submitter of yarn apps like application cannot be submitted for 
now since no node has the required label after restart. It's more straight 
forward because users can notice the mistake of configurations of labels.

So I think it's better way to log the warning at startup once and add the 
information to Web UI for the consistency of the information. What do you think?

 Should print WARN log in both RM/RMAdminCLI side when 
 MemoryRMNodeLabelsManager is enabled
 --

 Key: YARN-2800
 URL: https://issues.apache.org/jira/browse/YARN-2800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch


 Even though we have documented this, but it will be better to explicitly 
 print a message in both RM/RMAdminCLI side to explicitly say that the node 
 label being added will be lost across RM restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195220#comment-14195220
 ] 

Tsuyoshi OZAWA commented on YARN-2800:
--

[~leftnoteasy], Thanks for your clarification. Essentially, the configurations 
about labels are a part of RM's state. IMHO, we should move the essential 
configuration onto RMStateStore to prevent the mismatch ideally. I think ZK can 
handle it since frequency of updating labels is not so high and number of 
labels are not so large. cc: [~jianhe], [~kkambatl], what do you think?

 Should print WARN log in both RM/RMAdminCLI side when 
 MemoryRMNodeLabelsManager is enabled
 --

 Key: YARN-2800
 URL: https://issues.apache.org/jira/browse/YARN-2800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch


 Even though we have documented this, but it will be better to explicitly 
 print a message in both RM/RMAdminCLI side to explicitly say that the node 
 label being added will be lost across RM restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195258#comment-14195258
 ] 

Tsuyoshi OZAWA commented on YARN-2794:
--

[~jianhe], thanks for taking this JIRA. Shouldn't we use ConcurrentHashMap? 
IIUC, making the variable volatile for this case is not enough to synchronize. 
Please correct me if I'm wrong.

 Fix log msgs about distributing system-credentials 
 ---

 Key: YARN-2794
 URL: https://issues.apache.org/jira/browse/YARN-2794
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2794.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195273#comment-14195273
 ] 

Tsuyoshi OZAWA commented on YARN-2794:
--

[~jianhe], oops, I got it. The updating about systemCredentials is done only 
via setSystemCredentials. Then, your solution is enough. 
One minor nits: 
TestLogAggregationService#testAddNewTokenSentFromRMForLogAggregation calls   
{{this.context.getSystemCredentialsForApps().put(application1, credentials);}}. 
We should use ConcurrentHashMap for the test case.

 Fix log msgs about distributing system-credentials 
 ---

 Key: YARN-2794
 URL: https://issues.apache.org/jira/browse/YARN-2794
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2794.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled

2014-11-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195407#comment-14195407
 ] 

Tsuyoshi OZAWA commented on YARN-2800:
--

[~leftnoteasy], If we assume the labels as a configuration which can be highly 
updated, ZK is not good option as you mentioned. In this case, I think 
NodeLabelsManager, whose backend can be leveldb or rockdb, should be loosely 
coupling with RM like TimelineServer for stabilization of RM. One option is 
making NodeLabelsManager NodeLabelsServer.  It means RM should work correctly 
even if NodeLabelsManager is temporary unavailable. And update operation should 
only affect NodeLabelsManager(it doesn't affect RM). For example, RM pulls the 
label information from NodeLabelsServer periodically. RM treats the lable 
information as a hint and does schedule based on label information. Even 
without the information, RM should schedule apps. I think this weak consistency 
approach is suitable for large-scale updating.

 Should print WARN log in both RM/RMAdminCLI side when 
 MemoryRMNodeLabelsManager is enabled
 --

 Key: YARN-2800
 URL: https://issues.apache.org/jira/browse/YARN-2800
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2800-20141102-1.patch, YARN-2800-20141102-2.patch


 Even though we have documented this, but it will be better to explicitly 
 print a message in both RM/RMAdminCLI side to explicitly say that the node 
 label being added will be lost across RM restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-30 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189720#comment-14189720
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

[~adhoot] [~kkambatl] [~jianhe] do you have additional comments?

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks

2014-10-30 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189835#comment-14189835
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

Thanks Anubhav and Karhitk for the reviews.

 TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks
 

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Fix For: 2.7.0

 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2765) Add leveldb-based implementation for RMStateStore

2014-10-30 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189934#comment-14189934
 ] 

Tsuyoshi OZAWA commented on YARN-2765:
--

Currently we can assume that LeveldbRMStateStore is access from single process, 
thus failure detection itself is done by EmbeddedElector which depends on 
ZooKeeper in addition to no support of fencing on LevelDBRMStateStore. It means 
we need to launch ZooKeeper and it's normal decision to use ZKRMStateStore in 
this case. Please correct me if I'm wrong.

On another front, if we use RockDB as a backend db of timeline server, we don't 
need to use leveldb and it's good decision to switch the dependency.

 Add leveldb-based implementation for RMStateStore
 -

 Key: YARN-2765
 URL: https://issues.apache.org/jira/browse/YARN-2765
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2765.patch, YARN-2765v2.patch


 It would be nice to have a leveldb option to the resourcemanager recovery 
 store. Leveldb would provide some benefits over the existing filesystem store 
 such as better support for atomic operations, fewer I/O ops per state update, 
 and far fewer total files on the filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1813:
-
Attachment: YARN-1813.5.patch

Refreshed a patch.

 Better error message for yarn logs when permission denied
 ---

 Key: YARN-1813
 URL: https://issues.apache.org/jira/browse/YARN-1813
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0
Reporter: Andrew Wang
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, 
 YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch


 I ran some MR jobs as the hdfs user, and then forgot to sudo -u when 
 grabbing the logs. yarn logs prints an error message like the following:
 {noformat}
 [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
 a2402.halxg.cloudera.com/10.20.212.10:8032
 Logs not available at 
 /tmp/logs/andrew.wang/logs/application_1394482121761_0010
 Log aggregation has not completed or is not enabled.
 {noformat}
 It'd be nicer if it said Permission denied or AccessControlException or 
 something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1813:
-
Affects Version/s: 2.4.1
   2.5.1

 Better error message for yarn logs when permission denied
 ---

 Key: YARN-1813
 URL: https://issues.apache.org/jira/browse/YARN-1813
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0, 2.4.1, 2.5.1
Reporter: Andrew Wang
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, 
 YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch


 I ran some MR jobs as the hdfs user, and then forgot to sudo -u when 
 grabbing the logs. yarn logs prints an error message like the following:
 {noformat}
 [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
 a2402.halxg.cloudera.com/10.20.212.10:8032
 Logs not available at 
 /tmp/logs/andrew.wang/logs/application_1394482121761_0010
 Log aggregation has not completed or is not enabled.
 {noformat}
 It'd be nicer if it said Permission denied or AccessControlException or 
 something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2737) Misleading msg in LogCLI when app is not successfully submitted

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186548#comment-14186548
 ] 

Tsuyoshi OZAWA commented on YARN-2737:
--

YARN-1813 is addressing a issue for handling AccessControlException correctly.

 Misleading msg in LogCLI when app is not successfully submitted 
 

 Key: YARN-2737
 URL: https://issues.apache.org/jira/browse/YARN-2737
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Jian He
Assignee: Rohith

 {{LogCLiHelpers#logDirNotExist}} prints msg {{Log aggregation has not 
 completed or is not enabled.}} if the app log file doesn't exist. This is 
 misleading because if the application is not submitted successfully. Clearly, 
 we won't have logs for this application. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1813) Better error message for yarn logs when permission denied

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1813:
-
Attachment: YARN-1813.6.patch

Thanks for your review, Rohith! Updated:

1. Changed a log message to Permission denied. : /path/to/dir
2. Removed needless change in AggregatedLogsBlock.
3. Updated log message in {{logDirNotExist}}.


 Better error message for yarn logs when permission denied
 ---

 Key: YARN-1813
 URL: https://issues.apache.org/jira/browse/YARN-1813
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.3.0, 2.4.1, 2.5.1
Reporter: Andrew Wang
Assignee: Tsuyoshi OZAWA
Priority: Minor
 Attachments: YARN-1813.1.patch, YARN-1813.2.patch, YARN-1813.2.patch, 
 YARN-1813.3.patch, YARN-1813.4.patch, YARN-1813.5.patch, YARN-1813.6.patch


 I ran some MR jobs as the hdfs user, and then forgot to sudo -u when 
 grabbing the logs. yarn logs prints an error message like the following:
 {noformat}
 [andrew.wang@a2402 ~]$ yarn logs -applicationId application_1394482121761_0010
 14/03/10 16:05:10 INFO client.RMProxy: Connecting to ResourceManager at 
 a2402.halxg.cloudera.com/10.20.212.10:8032
 Logs not available at 
 /tmp/logs/andrew.wang/logs/application_1394482121761_0010
 Log aggregation has not completed or is not enabled.
 {noformat}
 It'd be nicer if it said Permission denied or AccessControlException or 
 something like that instead, since that's the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2737) Misleading msg in LogCLI when app is not successfully submitted

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187253#comment-14187253
 ] 

Tsuyoshi OZAWA commented on YARN-2737:
--

[~jianhe], [~rohithsharma] reviewed a patch on YARN-1813. It includes the 
comment about this issue. Could you take a look?

 Misleading msg in LogCLI when app is not successfully submitted 
 

 Key: YARN-2737
 URL: https://issues.apache.org/jira/browse/YARN-2737
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Jian He
Assignee: Rohith

 {{LogCLiHelpers#logDirNotExist}} prints msg {{Log aggregation has not 
 completed or is not enabled.}} if the app log file doesn't exist. This is 
 misleading because if the application is not submitted successfully. Clearly, 
 we won't have logs for this application. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2742) FairSchedulerConfiguration fails to parse if there is extra space between value and unit

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187264#comment-14187264
 ] 

Tsuyoshi OZAWA commented on YARN-2742:
--

[~ywskycn], thanks for contribution. How about adding a test case to include 
trailing space like  1024 mb, 4 core ?

 FairSchedulerConfiguration fails to parse if there is extra space between 
 value and unit
 

 Key: YARN-2742
 URL: https://issues.apache.org/jira/browse/YARN-2742
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2742-1.patch


 FairSchedulerConfiguration is very strict about the number of space 
 characters between the value and the unit: 0 or 1 space.
 For example, for values like the following:
 {noformat}
 maxResources4096  mb, 2 vcoresmaxResources
 {noformat}
 (note 2 spaces)
 This above line fails to parse:
 {noformat}
 2014-10-24 22:56:40,802 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
  Failed to reload fair scheduler config file - will use existing allocations.
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
  Missing resource: mb
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.findResource(FairSchedulerConfiguration.java:247)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.parseResourceConfigValue(FairSchedulerConfiguration.java:231)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:347)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:381)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:293)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService$1.run(AllocationFileLoaderService.java:117)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2742) FairSchedulerConfiguration fails to parse if there is extra space between value and unit

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187428#comment-14187428
 ] 

Tsuyoshi OZAWA commented on YARN-2742:
--

LGTM.

 FairSchedulerConfiguration fails to parse if there is extra space between 
 value and unit
 

 Key: YARN-2742
 URL: https://issues.apache.org/jira/browse/YARN-2742
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: Sangjin Lee
Assignee: Wei Yan
Priority: Minor
 Attachments: YARN-2742-1.patch, YARN-2742-2.patch


 FairSchedulerConfiguration is very strict about the number of space 
 characters between the value and the unit: 0 or 1 space.
 For example, for values like the following:
 {noformat}
 maxResources4096  mb, 2 vcoresmaxResources
 {noformat}
 (note 2 spaces)
 This above line fails to parse:
 {noformat}
 2014-10-24 22:56:40,802 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
  Failed to reload fair scheduler config file - will use existing allocations.
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
  Missing resource: mb
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.findResource(FairSchedulerConfiguration.java:247)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerConfiguration.parseResourceConfigValue(FairSchedulerConfiguration.java:231)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:347)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:381)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:293)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService$1.run(AllocationFileLoaderService.java:117)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2765) Add leveldb-based implementation for RMStateStore

2014-10-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187955#comment-14187955
 ] 

Tsuyoshi OZAWA commented on YARN-2765:
--

[~jlowe], great work! It looks good to me overall including error handling and 
resource management.

Minor nits:

How about adding helper methods like getKeyPrefix/getNodePath for getting key 
prefix and node path? ZKRMStateStore also does so.
{code}
String keyPrefix = RM_APP_ROOT + / + appId + /;
...
String appKey = RM_APP_ROOT + / + appId
{code}

I found that the patch includes lots hard-coded /.  I think it's better to 
have private field SEPARATOR = /. 

 Add leveldb-based implementation for RMStateStore
 -

 Key: YARN-2765
 URL: https://issues.apache.org/jira/browse/YARN-2765
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2765.patch


 It would be nice to have a leveldb option to the resourcemanager recovery 
 store. Leveldb would provide some benefits over the existing filesystem store 
 such as better support for atomic operations, fewer I/O ops per state update, 
 and far fewer total files on the filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore

2014-10-27 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186402#comment-14186402
 ] 

Tsuyoshi OZAWA commented on YARN-2725:
--

[~jianhe], do you mind taking a look?

 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2725.1.patch


 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-26 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184603#comment-14184603
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

[~kkambatl], could you take a look?

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-26 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184679#comment-14184679
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

[~adhoot] [~kkambatl], oops, I misread the previous review comment from 
Karthik. Thanks for your review, Anubhav.

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA reassigned YARN-2712:


Assignee: Tsuyoshi OZAWA

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA

 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2712:
-
Attachment: YARN-2712.1.patch

Attaching a first patch including following changes:

1. Adding tests about FSQueue({{checkFSQueue}}).
2. Moving headroom tests into check*Queue.
3. Renamed asserteMetrics to assertMetrics.
4. Calling {{updateRootQueueMetrics}} explicitly in {{FairScheduler#update}} 
because I found a unexpected behavior about rootQueue while writing code - 
{{updateRootQueueMetrics}} isn't called until RMNode is registered, updated, 
removed, and added.


 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2725:
-
Attachment: YARN-2725.1.patch

Attaching a first patch.

 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA
 Attachments: YARN-2725.1.patch


 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182760#comment-14182760
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

The test failure looks not related to the patch. [~kkambatl], [~jianhe], do you 
mind taking a look, please?

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2712:
-
Attachment: YARN-2712.2.patch

Thanks for your review, Karthik. All lines you pointed can be removed. Updated.

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-24 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183845#comment-14183845
 ] 

Tsuyoshi OZAWA commented on YARN-2712:
--

The javadoc warning and test failures of TestMetricsSystemImpl and 
TestWebDelegationToken looks intermittent.  

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2712.1.patch, YARN-2712.2.patch


 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-10-23 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-810:

Issue Type: Improvement  (was: Bug)

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
 {noformat}
 On my dev box, I was testing CGroups by running a python process eight times, 
 to burn through all the cores, since it was doing as described above (giving 
 extra CPU to the process, even with a cpu.shares limit). Toggling the 
 cfs_quota_us seems to enforce a hard limit.
 Implementation:
 What do you guys think about introducing a variable to YarnConfiguration:
 bq. 

[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2398:
-
Attachment: (was: TestResourceTrackerOnHA-output.txt)

 TestResourceTrackerOnHA crashes
 ---

 Key: YARN-2398
 URL: https://issues.apache.org/jira/browse/YARN-2398
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe

 TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179797#comment-14179797
 ] 

Tsuyoshi OZAWA commented on YARN-2398:
--

Rohith, Wangda, yeah, thanks for your pointing. the log I attached looks not 
related to the issue Jason mentioned. Removing it. 

 TestResourceTrackerOnHA crashes
 ---

 Key: YARN-2398
 URL: https://issues.apache.org/jira/browse/YARN-2398
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe

 TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2725) Adding retry requests about ZKRMStateStore

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-2725:


 Summary: Adding retry requests about ZKRMStateStore
 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA


YARN-2721 found a race condition for ZK-specific retry semantics. We should add 
tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2721) Race condition: ZKRMStateStore retry logic may throw NodeExist exception

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179842#comment-14179842
 ] 

Tsuyoshi OZAWA commented on YARN-2721:
--

Good job, Jian. Created YARN-2725 for adding tests to cover these cases.

 Race condition: ZKRMStateStore retry logic may throw NodeExist exception 
 -

 Key: YARN-2721
 URL: https://issues.apache.org/jira/browse/YARN-2721
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.6.0

 Attachments: YARN-2721.1.patch


 Blindly retrying operations in zookeeper will not work for non-idempotent 
 operations (like create znode). The reason is that the client can do a create 
 znode, but the response may not be returned because the server can die or 
 timeout. In case of retrying the create znode, it will throw a NODE_EXISTS 
 exception from the earlier create from the same session.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2725) Adding test cases of retrying requests about ZKRMStateStore

2014-10-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2725:
-
Summary: Adding test cases of retrying requests about ZKRMStateStore  (was: 
Adding retry requests about ZKRMStateStore)

 Adding test cases of retrying requests about ZKRMStateStore
 ---

 Key: YARN-2725
 URL: https://issues.apache.org/jira/browse/YARN-2725
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 YARN-2721 found a race condition for ZK-specific retry semantics. We should 
 add tests about the case of retry requests to ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2710) RM HA tests failed intermittently on trunk

2014-10-21 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2710:
-
Attachment: TestResourceTrackerOnHA-output.2.txt

I could reproduced same issue about TestResourceTrackerOnHA - it's intermittent 
failure, and it happens rarely. Attaching log on my local.

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: TestResourceTrackerOnHA-output.2.txt, 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-20 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-2712:


 Summary: Adding tests about FSQueue and headroom of FairScheduler 
to TestWorkPreservingRMRestart
 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA


TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about 
FairScheduler partially. We should support them.

{code}
   // Until YARN-1959 is resolved
   if (scheduler.getClass() != FairScheduler.class) {
 assertEquals(availableResources, schedulerAttempt.getHeadroom());
   }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-20 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2712:
-
Issue Type: Sub-task  (was: Test)
Parent: YARN-556

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2014-10-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176288#comment-14176288
 ] 

Tsuyoshi OZAWA commented on YARN-2710:
--

[~leftnoteasy], it passes on my local too. I checked the log you attached - it 
failed since EOFException occured. EOFException can happen with different 
protobuf format mixes. Could you retry the test after {{mvn clean}}? It 
sometimes resolves the problem.

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176320#comment-14176320
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Thanks Jian, Karthik, Vinod, Xuan and Anubhav for reviews and comments! 

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2014-10-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176421#comment-14176421
 ] 

Tsuyoshi OZAWA commented on YARN-2710:
--

BTW, YARN-2398 is addressing intermittent failure of TestResourceTrackerOnHA.

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-2710) RM HA tests failed intermittently on trunk

2014-10-19 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-2710.
--
Resolution: Duplicate

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2014-10-19 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176424#comment-14176424
 ] 

Tsuyoshi OZAWA commented on YARN-2710:
--

Closing this issue as dup of YARN-2398.

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174916#comment-14174916
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

The failures of TestFairScheduler and TestResourceTrackerOnHA is not related to 
the patch:
* TestFairScheduler fails intermittently - it passes on my local.
* TestResourceTrackerOnHA fails on trunk.

[~jianhe], could you take a look?

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-17 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.29.patch

Thanks for your review, Jian.

I see, it's better because it's more similar to the real code path. I also 
noticed that v28 patch includes needless changes which was introduced old 
version of patches. Attaching a new patch(v29) to removed them and making test 
code simple.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-17 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175678#comment-14175678
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Thanks for your comment, Wangda. I confirmed the failure is not related to the 
patch - all results of failures are same:
{quote}
java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in 
queue=root.default
{quote}

I also confirmed that the tests passes on my local.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.26.patch

Rebased on trunk.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2398) TestResourceTrackerOnHA crashes

2014-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2398:
-
Attachment: TestResourceTrackerOnHA-output.txt

Reproduced the issue on my local. Attaching log.

 TestResourceTrackerOnHA crashes
 ---

 Key: YARN-2398
 URL: https://issues.apache.org/jira/browse/YARN-2398
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe
 Attachments: TestResourceTrackerOnHA-output.txt


 TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-16 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173728#comment-14173728
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

The test failure is not related to the patch and being filed as YARN-2398 - it 
still fails without patch.  [~jianhe], [~kkambatl] could you review latest 
patch?

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.27.patch

Updated to reuse ugi object before and after restart.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.28.patch

Fixed the test failure of TestContainerResourceUsage and TestClientToAMTokens.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-15 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312-branch-2.8.patch

[~jianhe] [~jlowe] Thanks for your comment. Attaching a patch for branch-2.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-branch-2.8.patch, YARN-2312-wip.patch, 
 YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, 
 YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, 
 YARN-2312.7.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-15 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.25.patch

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-15 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173346#comment-14173346
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Talked with Jian offline. 

{quote}
In this case, token is expired for the application after finishing AM's 
container and I think we don't need to handle it.
{quote}

I'd like to confirm whether finishApplicationMaster() can be issued after AM 
containers exit. There are no such case, but finishApplicationMaster() can be 
issued after RM's removing AM's entry in a following case:

1. RM1 saves the app in RMStateStore and then crashes.
2. FinishApplicationMasterResponse#isRegistered still return false.
3. The AM still retries the 2nd RM.

Thanks very much for clarifying, Jian. Attached a updated patch which includes 
a test for retried finishApplicationMaster and a test for retried 
registerApplicationMaster before and after RM-restart.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-12 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168671#comment-14168671
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

The test faliure of TestAMRestart looks not related to the patch since it 
passes locally.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-11 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.24.patch

Updated:
* Renaming TestApplicationMasterServiceProtocolOnHA#initiate to initialize().
* About registerApplicatinMaster(), added test case for duplicated request.

{quote}
We should probably move the following check in finishApplicationMaster call 
upfront so that ApplicationDoesNotExistInCacheException is not thrown for 
already completed apps.
{quote}

In this case, token is expired for the application after finishing AM's 
container and I think we don't need to handle it. [~jianhe], what do you think?


 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.7.patch

Thanks Jian, you're right. Updating to use Long.parseLong instead of 
Integer.parseInt in YarnChild.java.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, YARN-2312.7.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-08 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163248#comment-14163248
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

s/YARN-6115/MAPREDUCE-6115/

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-08 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163468#comment-14163468
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Thanks Karthik and Jian for your review!

{quote}
We should probably move the following check in finishApplicationMaster call 
upfront so that ApplicationDoesNotExistInCacheException is not thrown for 
already completed apps.
{quote}

Good catch. OK, I'll move the check before upfront.

{quote}
2. We should also probably add a test that makes duplicate requests to the 
same/different RM and verify the behavior is as expected. 
{quote}

Maybe TestWorkPreservingRMRestart is good place to add tests. I'll try it.


 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2666) TestFairScheduler.testContinuousScheduling fails Intermittently

2014-10-08 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created YARN-2666:


 Summary: TestFairScheduler.testContinuousScheduling fails 
Intermittently
 Key: YARN-2666
 URL: https://issues.apache.org/jira/browse/YARN-2666
 Project: Hadoop YARN
  Issue Type: Test
  Components: scheduler
Reporter: Tsuyoshi OZAWA
Assignee: Wei Yan


The test fails on trunk.
{code}
Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec  
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
  Time elapsed: 0.582 sec   FAILURE!
java.lang.AssertionError: expected:2 but was:1
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2252) Intermittent failure of TestFairScheduler.testContinuousScheduling

2014-10-08 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164570#comment-14164570
 ] 

Tsuyoshi OZAWA commented on YARN-2252:
--

[~kkambatl] [~ywskycn] Thank you, opened YARN-2666 to address the problem.

 Intermittent failure of TestFairScheduler.testContinuousScheduling
 --

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
  Labels: hadoop2, scheduler, yarn
 Fix For: 2.6.0

 Attachments: YARN-2252-1.patch, yarn-2252-2.patch


 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2014-10-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161724#comment-14161724
 ] 

Tsuyoshi OZAWA commented on YARN-1458:
--

Sure, thanks for reply :-)

 FairScheduler: Zero weight can lead to livelock
 ---

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
 YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
 YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
 yarn-1458-7.patch, yarn-1458-8.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-07 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.6.patch

Thanks Jason and Jian for review. Updated:

* Removed unnecessary change in TestTaskAttemptListenerImpl.java - sorry for 
this, this change was included wrongly.
* Defined 0xffL as CONTAINER_ID_BITMASK and exposed it.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-07 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.23.patch

The failure of TestClientToAMTokens looks not related - it passed on my local. 
Let me submit same patch again.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2252) Intermittent failure of TestFairScheduler.testContinuousScheduling

2014-10-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163118#comment-14163118
 ] 

Tsuyoshi OZAWA commented on YARN-2252:
--

Hi, this test failure is found on trunk - you can find it 
[here|https://issues.apache.org/jira/browse/YARN-2312?focusedCommentId=14161902page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14161902].
 Log is as follows:

{code}
Tests run: 79, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.698 sec  
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler
testContinuousScheduling(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler)
  Time elapsed: 0.582 sec   FAILURE!
java.lang.AssertionError: expected:2 but was:1
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler.testContinuousScheduling(TestFairScheduler.java:3372)
{code}

Can I reopen this problem?

 Intermittent failure of TestFairScheduler.testContinuousScheduling
 --

 Key: YARN-2252
 URL: https://issues.apache.org/jira/browse/YARN-2252
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: trunk-win
Reporter: Ratandeep Ratti
  Labels: hadoop2, scheduler, yarn
 Fix For: 2.6.0

 Attachments: YARN-2252-1.patch, yarn-2252-2.patch


 This test-case is failing sporadically on my machine. I think I have a 
 plausible explanation  for this.
 It seems that when the Scheduler is being asked for resources, the resource 
 requests that are being constructed have no preference for the hosts (nodes).
 The two mock hosts constructed, both have a memory of 8192 mb.
 The containers(resources) being requested each require a memory of 1024mb, 
 hence a single node can execute both the resource requests for the 
 application.
 In the end of the test-case it is being asserted that the containers 
 (resource requests) be executed on different nodes, but since we haven't 
 specified any preferences for nodes when requesting the resources, the 
 scheduler (at times) executes both the containers (requests) on the same node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-07 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163120#comment-14163120
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

The test failure is not related - the failure of TestFairScheduler is reported 
on YARN-2252 and the failure of TestPipeApplication is reported on YARN-6115. 

[~jianhe], [~jlowe], could you review latest patch?

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-06 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.22.patch

Made registerApplicationMaster Idempotent and marked 
registerApplicationMaser/finishApplicationMaster Idempotent.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161241#comment-14161241
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

For now, I have no idea to reconstruct same response after failover. Currently 
latest patch only return empty response. This is one discussion point of this 
design.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161499#comment-14161499
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Thanks for your comments, Jian and Karthik.

{quote}
from RM’s perspective, these are just new requests, as the new RM doesn’t have 
any cache for previous requests from client.
{quote}

I confirmed that it's true. Not only {{finishApplicationMaster}} but also 
{{registerApplicationMaster}} don't touch the data in ZK directly, so RM can 
handle retried requests transparently following cases: 

1. When EmbeddedElector choose different RM as a leader before and after the 
failover, ZK doesn't have the data of RMAttempt/RMApp. Then, RM recognizes a 
retried-request as a new request. e.g. there is active-RM(RM1) and 
standby-RM(RM2) and RM's leader failovers from RM1 to RM2.
2. Still when EmbeddedElector choose same RM as a leader before and after the 
failover, RM goes into standby state and RM stop all services before failover 
and it reload the data of RMAppAttempt/RMApp. In this case, RM recognizes a 
retried-request as a new request. e.g. there is active-RM(RM1) and 
standby-RM(RM2) and RM's leader failovers from RM1 to RM1. 

I think it has no problem to mark these methods as Idempotent.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, 
 YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, 
 YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over

2014-10-06 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.23.patch

Marked Idempotent annotations to registerApplicationMaster and 
finishApplicationMaster.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159806#comment-14159806
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

[~jianhe], [~xgong], thanks for your suggestion. I agree with your suggestion 
about {{registerApplicationMaster}}. How about {{finishApplicationMaster}}? We 
cannot distinguish the failure of RPC from the success of RPC after the change. 
What do you think?

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-05 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159832#comment-14159832
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Sounds good. Then we can simply remove retry-cache. I'm updating to remove 
retry-cache in next patch. [~kkambatl], please let me know if you have opinions 
about the design change. 

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-04 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.5.patch

Fixed the findbugs warning.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-04 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159425#comment-14159425
 ] 

Tsuyoshi OZAWA commented on YARN-2312:
--

The test failure of TestPipeApplication is not related to this JIRA and filed 
as MAPREDUCE-6120.

[~jlowe], could you take a look, please?

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157751#comment-14157751
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

Thanks for your comment, Karthik. Almost done to address your comments.

{quote}
Are there cases when we don't want RetryCache enabled? IMO, we should always 
use the RetryCache (no harm). If we decide on having a config, the default 
should be true.
{quote}

Basically, we can enable RetryCache without harm and implemented it. One 
concern is tests - on my local, lots tests fail because of following reason:

{code}
org.apache.hadoop.metrics2.MetricsException: Metrics source 
RetryCache.AppMasterServiceRetryCache already exists!
{code}

I'll try to fix them in next patch by adding 
{{DefaultMetricsSystem.setMiniClusterMode(true);}} to the tests.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.2-wip.patch, 
 YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.19.patch

Updated:
* Changed to enabled RetryCache by default. I think having configuration is 
better than always using RetryCache because of memory consumption.
* Set DEFAULT_RM_RETRY_CACHE_EXPIRY_MS to 10 * 60 * 1000 instead of 60.
* Changed TestApplicationMasterServiceRetryCache to have only lines shorter 
than 80 chars.
* Fixed some tests to call DefaultMetricsSystem.setMiniClusterMode(true).
* Renamed TestApplicationMasterService#*WithRetryCache to WithoutRetryCache and 
changed the tests with configuring RM_RETRY_CACHE_ENABLED false.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, 
 YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, 
 YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.20.patch

Fixed TestAMRMClientOnRMRestart. The failure of TestAMRestart looks not related.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, 
 YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1879:
-
Attachment: YARN-1879.21.patch

Thank you for the review, Karthik. Updated a patch:
* Rebased a patch on trunk.
* Updated yarn-default.xml.
* Fixed the value of RM_RETRY_CACHE_EXPIRY_MS from v20 as follows:
{code}
-  public static final String RM_RETRY_CACHE_EXPIRY_MS =
-  RM_PREFIX +  .retry-cache + .expiry-ms;
+  public static final String RM_RETRY_CACHE_EXPIRY_MS =
+  RM_PREFIX + retry-cache.expiry-ms;
{code}

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158948#comment-14158948
 ] 

Tsuyoshi OZAWA commented on YARN-1879:
--

The failure TestRMWebServicesDelegationTokens looks not related and 
intermittent failure - the test passed on my local.

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated

2014-10-03 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2312:
-
Attachment: YARN-2312.4.patch

Refreshed a patch.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182

2014-10-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2562:
-
Attachment: (was: YARN-2562.5.patch)

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, 
 YARN-2562.4.patch, YARN-2562.5.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182

2014-10-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2562:
-
Attachment: YARN-2562.5.patch

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, 
 YARN-2562.4.patch, YARN-2562.5.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182

2014-10-02 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2562:
-
Attachment: YARN-2562.5-2.patch

 ContainerId@toString() is unreadable for epoch 0 after YARN-2182
 -

 Key: YARN-2562
 URL: https://issues.apache.org/jira/browse/YARN-2562
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, 
 YARN-2562.4.patch, YARN-2562.5-2.patch, YARN-2562.5.patch


 ContainerID string format is unreadable for RMs that restarted at least once 
 (epoch  0) after YARN-2182. For e.g, 
 container_1410901177871_0001_01_05_17.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    2   3   4   5   6   7   8   9   10   11   >