[jira] [Commented] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService
[ https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960988#comment-13960988 ] Hadoop QA commented on YARN-1904: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638841/YARN-1904.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3515//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3515//console This message is automatically generated. > Uniform the NotFound messages from ClientRMService and > ApplicationHistoryClientService > -- > > Key: YARN-1904 > URL: https://issues.apache.org/jira/browse/YARN-1904 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1904.1.patch > > > It's good to make ClientRMService and ApplicationHistoryClientService throw > NotFoundException with similar messages -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService
[ https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1904: -- Target Version/s: 2.4.1 > Uniform the NotFound messages from ClientRMService and > ApplicationHistoryClientService > -- > > Key: YARN-1904 > URL: https://issues.apache.org/jira/browse/YARN-1904 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1904.1.patch > > > It's good to make ClientRMService and ApplicationHistoryClientService throw > NotFoundException with similar messages -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService
[ https://issues.apache.org/jira/browse/YARN-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1904: -- Attachment: YARN-1904.1.patch Create a patch, simple message editing, without new test cases > Uniform the NotFound messages from ClientRMService and > ApplicationHistoryClientService > -- > > Key: YARN-1904 > URL: https://issues.apache.org/jira/browse/YARN-1904 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1904.1.patch > > > It's good to make ClientRMService and ApplicationHistoryClientService throw > NotFoundException with similar messages -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1904) Uniform the XXXXNotFound messages from ClientRMService and ApplicationHistoryClientService
Zhijie Shen created YARN-1904: - Summary: Uniform the NotFound messages from ClientRMService and ApplicationHistoryClientService Key: YARN-1904 URL: https://issues.apache.org/jira/browse/YARN-1904 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen It's good to make ClientRMService and ApplicationHistoryClientService throw NotFoundException with similar messages -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1701) Improve default paths of timeline store and generic history store
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1701: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-321) > Improve default paths of timeline store and generic history store > - > > Key: YARN-1701 > URL: https://issues.apache.org/jira/browse/YARN-1701 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-1701.v01.patch > > > When I enable AHS via yarn.ahs.enabled, the app history is still not visible > in AHS webUI. This is due to NullApplicationHistoryStore as > yarn.resourcemanager.history-writer.class. It would be good to have just one > key to enable basic functionality. > yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is > local file system location. However, FileSystemApplicationHistoryStore uses > DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1701) Improve default paths of timeline store and generic history store
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1701: -- Summary: Improve default paths of timeline store and generic history store (was: More intuitive defaults for AHS) > Improve default paths of timeline store and generic history store > - > > Key: YARN-1701 > URL: https://issues.apache.org/jira/browse/YARN-1701 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-1701.v01.patch > > > When I enable AHS via yarn.ahs.enabled, the app history is still not visible > in AHS webUI. This is due to NullApplicationHistoryStore as > yarn.resourcemanager.history-writer.class. It would be good to have just one > key to enable basic functionality. > yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is > local file system location. However, FileSystemApplicationHistoryStore uses > DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1701) More intuitive defaults for AHS
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960967#comment-13960967 ] Zhijie Shen commented on YARN-1701: --- [~jira.shegalov], would you mind updating the patch? It no longer applies. And can we have one shot fix for both the timeline store and the generic history store path? Thanks! > More intuitive defaults for AHS > --- > > Key: YARN-1701 > URL: https://issues.apache.org/jira/browse/YARN-1701 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-1701.v01.patch > > > When I enable AHS via yarn.ahs.enabled, the app history is still not visible > in AHS webUI. This is due to NullApplicationHistoryStore as > yarn.resourcemanager.history-writer.class. It would be good to have just one > key to enable basic functionality. > yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is > local file system location. However, FileSystemApplicationHistoryStore uses > DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1701) More intuitive defaults for AHS
[ https://issues.apache.org/jira/browse/YARN-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1701: -- Affects Version/s: (was: 2.4.0) 2.4.1 > More intuitive defaults for AHS > --- > > Key: YARN-1701 > URL: https://issues.apache.org/jira/browse/YARN-1701 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-1701.v01.patch > > > When I enable AHS via yarn.ahs.enabled, the app history is still not visible > in AHS webUI. This is due to NullApplicationHistoryStore as > yarn.resourcemanager.history-writer.class. It would be good to have just one > key to enable basic functionality. > yarn.ahs.fs-history-store.uri uses {code}${hadoop.log.dir}{code}, which is > local file system location. However, FileSystemApplicationHistoryStore uses > DFS by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM
[ https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960941#comment-13960941 ] Hudson commented on YARN-1898: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5460 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5460/]) YARN-1898. Addendum patch to ensure /jmx and /metrics are re-directed to Active RM. (acmurthy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1584954) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebAppFilter.java > Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are > redirecting to Active RM > - > > Key: YARN-1898 > URL: https://issues.apache.org/jira/browse/YARN-1898 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Yesha Vora >Assignee: Xuan Gong > Fix For: 2.4.1 > > Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, > YARN-1898.addendum.patch, YARN-1898.addendum.patch > > > Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to > Active RM. > It should not be redirected to Active RM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1878) Yarn standby RM taking long to transition to active
[ https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960929#comment-13960929 ] Arun C Murthy commented on YARN-1878: - [~xgong] is this ready to go? Let's get this into 2.4.1. Tx. > Yarn standby RM taking long to transition to active > --- > > Key: YARN-1878 > URL: https://issues.apache.org/jira/browse/YARN-1878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong > Attachments: YARN-1878.1.patch > > > In our HA tests we are noticing that some times it can take upto 10s for the > standby RM to transition to active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1878) Yarn standby RM taking long to transition to active
[ https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1878: Target Version/s: 2.4.1 > Yarn standby RM taking long to transition to active > --- > > Key: YARN-1878 > URL: https://issues.apache.org/jira/browse/YARN-1878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong > Attachments: YARN-1878.1.patch > > > In our HA tests we are noticing that some times it can take upto 10s for the > standby RM to transition to active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1878) Yarn standby RM taking long to transition to active
[ https://issues.apache.org/jira/browse/YARN-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1878: Priority: Blocker (was: Major) > Yarn standby RM taking long to transition to active > --- > > Key: YARN-1878 > URL: https://issues.apache.org/jira/browse/YARN-1878 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1878.1.patch > > > In our HA tests we are noticing that some times it can take upto 10s for the > standby RM to transition to active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM
[ https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960826#comment-13960826 ] Hadoop QA commented on YARN-1898: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638792/YARN-1898.addendum.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3514//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3514//console This message is automatically generated. > Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are > redirecting to Active RM > - > > Key: YARN-1898 > URL: https://issues.apache.org/jira/browse/YARN-1898 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Yesha Vora >Assignee: Xuan Gong > Fix For: 2.4.1 > > Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, > YARN-1898.addendum.patch, YARN-1898.addendum.patch > > > Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to > Active RM. > It should not be redirected to Active RM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960809#comment-13960809 ] Xuan Gong commented on YARN-1903: - +1 LGTM Also, I run the TestNMClient with this patch applied on Windows several times. All of them are passed. > Killing Container on NEW and LOCALIZING will result in exitCode and > diagnostics not set > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1903.1.patch > > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1872) TestDistributedShell occasionally fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960797#comment-13960797 ] Hong Zhiguo commented on YARN-1872: --- Yes. It is. And MapRedue V2 AM contains some code to work around for this strange behavior. I'll review YARN-1902 patch later. But anyway, it's better to move the check to inside the loop (What's done in this patch). > TestDistributedShell occasionally fails in trunk > > > Key: YARN-1872 > URL: https://issues.apache.org/jira/browse/YARN-1872 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Hong Zhiguo >Priority: Blocker > Attachments: TestDistributedShell.out, YARN-1872.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : > TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and > TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960776#comment-13960776 ] Hadoop QA commented on YARN-1903: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638788/YARN-1903.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3513//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3513//console This message is automatically generated. > Killing Container on NEW and LOCALIZING will result in exitCode and > diagnostics not set > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1903.1.patch > > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960541#comment-13960541 ] Jason Lowe commented on YARN-1769: -- The patch no longer applies cleanly after YARN-1512. Other comments on the patch: - Nit: In LeafQueue.assignToQueue we could cache Resource.add(usedResources, required) in a local when we're computing potentialNewCapacity so we don't have to recompute it as part of the potentialNewWithoutReservedCapacity computation - LeafQueue.assignToQueue and LeafQueue.assignToUser don't seem to need the new priority argument, and therefore LeafQueue.checkLimitsToReserve wouldn't seem to need it either once those others are updated. - Should FiCaSchedulerApp getAppToUnreserve really be called getNodeIdToUnreserve or geNodeToUnreserve, since it's returning a node ID rather than an app? - In LeafQueue.findNodeToUnreserve, isn't it kinda bad if the app thinks it has reservations on the node but the scheduler doesn't know about it? Wondering if the bookkeeping is messed up at that point therefore something a bit more than debug is an appropriate log level and if further fixup is needed. - LeafQueue.findNodeToUnreserve is adjusting the headroom when it unreserves, but I don't see other unreservations doing a similar calculation. Wondering if this fixup is something that should have been in completedContainer or needs to be done elsewhere? I could easily be missing something here but asking just in case other unreservation situations also need to have the headroom fixed. - LeafQueue.assignContainer uses the much more expensive scheduler.getConfiguration().getReservationContinueLook() when it should be able to use the reservationsContinueLooking member instead. - LeafQueue.getReservationContinueLooking should be package private - Nit: LeafQueue.assignContainer has some reformatting of the log message after the "// Inform the node" comment which was clearer to read/maintain before since the label and the value were always on a line by themselves. Same goes for the "Reserved container" log towards the end of the method. - Ultra-Nit: ParentQueue.setupQueueConfig's log message should have the reservationsContinueLooking on the previous line to match the style of other label/value pairs in the log message. - ParentQueue.getReservationContinueLooking should be package private. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1898) Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are redirecting to Active RM
[ https://issues.apache.org/jira/browse/YARN-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1898: Attachment: YARN-1898.addendum.patch submit the same patch again to kill off the Jenkins > Standby RM's conf, stacks, logLevel, metrics, jmx and logs links are > redirecting to Active RM > - > > Key: YARN-1898 > URL: https://issues.apache.org/jira/browse/YARN-1898 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Yesha Vora >Assignee: Xuan Gong > Fix For: 2.4.1 > > Attachments: YARN-1898.1.patch, YARN-1898.2.patch, YARN-1898.3.patch, > YARN-1898.addendum.patch, YARN-1898.addendum.patch > > > Standby RM links /conf, /stacks, /logLevel, /metrics, /jmx is redirected to > Active RM. > It should not be redirected to Active RM -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1903: -- Attachment: YARN-1903.1.patch Upload a patch to fix these issues > Killing Container on NEW and LOCALIZING will result in exitCode and > diagnostics not set > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-1903.1.patch > > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu resolved YARN-1901. --- Resolution: Duplicate > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960515#comment-13960515 ] Fengdong Yu commented on YARN-1901: --- Yes, exactly duplicated, thanks, I've closed it. > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1903) Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1903: -- Summary: Killing Container on NEW and LOCALIZING will result in exitCode and diagnostics not set (was: TestNMClient fails occasionally) > Killing Container on NEW and LOCALIZING will result in exitCode and > diagnostics not set > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) TestNMClient fails occasionally
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960480#comment-13960480 ] Zhijie Shen commented on YARN-1903: --- I did more investigation. Instead of a test failure, it sound more like a bug on container life cycle to me: 1. If a container is killed on NEW, the exit code and diagnostics will never be set. 2. If a container is killed on LOCALIZING, the exit code will never be set. > TestNMClient fails occasionally > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1903) TestNMClient fails occasionally
[ https://issues.apache.org/jira/browse/YARN-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960418#comment-13960418 ] Zhijie Shen commented on YARN-1903: --- I found the following log: {code} 2014-04-04 05:08:01,361 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(785)) - Returning ContainerStatus: [ContainerId: container_1396613275302_0001_01_04, State: RUNNING, Diagnostics: , ExitStatus: -1000, ] 2014-04-04 05:08:01,365 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(718)) - Stopping container with container Id: container_1396613275302_0001_01_04 2014-04-04 05:08:01,366 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=jenkins IP=10.79.62.28 OPERATION=Stop Container RequestTARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1396613275302_0001 CONTAINERID=container_1396613275302_0001_01_04 2014-04-04 05:08:01,387 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:isEnabled(169)) - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 2014-04-04 05:08:01,387 INFO containermanager.AuxServices (AuxServices.java:handle(175)) - Got event CONTAINER_STOP for appId application_1396613275302_0001 2014-04-04 05:08:01,389 INFO application.Application (ApplicationImpl.java:transition(296)) - Adding container_1396613275302_0001_01_04 to application application_1396613275302_0001 2014-04-04 05:08:01,389 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=jenkins OPERATION=Container Finished - Killed TARGET=ContainerImplRESULT=SUCCESS APPID=application_1396613275302_0001 CONTAINERID=container_1396613275302_0001_01_04 2014-04-04 05:08:01,389 INFO container.Container (ContainerImpl.java:handle(884)) - Container container_1396613275302_0001_01_04 transitioned from NEW to DONE 2014-04-04 05:08:01,389 INFO application.Application (ApplicationImpl.java:transition(339)) - Removing container_1396613275302_0001_01_04 from application application_1396613275302_0001 2014-04-04 05:08:01,390 INFO util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(182)) - ProcfsBasedProcessTree currently is supported only on Linux. 2014-04-04 05:08:01,392 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(321)) - container_1396613275302_0001_01_04 Container Transitioned from ACQUIRED to RUNNING 2014-04-04 05:08:01,393 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(771)) - Getting container-status for container_1396613275302_0001_01_04 2014-04-04 05:08:01,393 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(785)) - Returning ContainerStatus: [ContainerId: container_1396613275302_0001_01_04, State: COMPLETE, Diagnostics: , ExitStatus: -1000, ] {code} When the kill event is received, the container is still at NEW, it is moved to DONE by going through ContainerDoneTransition, which won't set the killing related exitcode and diagnostics. > TestNMClient fails occasionally > --- > > Key: YARN-1903 > URL: https://issues.apache.org/jira/browse/YARN-1903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > The container status after stopping container is not expected. > {code} > java.lang.AssertionError: 4: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) > at > org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1903) TestNMClient fails occasionally
Zhijie Shen created YARN-1903: - Summary: TestNMClient fails occasionally Key: YARN-1903 URL: https://issues.apache.org/jira/browse/YARN-1903 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen The container status after stopping container is not expected. {code} java.lang.AssertionError: 4: at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:382) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:346) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1872) TestDistributedShell occasionally fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960357#comment-13960357 ] Zhijie Shen commented on YARN-1872: --- bq. After the DistributedShell AM requested numTotalContainers containers, RM main allocate more than that. [~zhiguohong], thanks for working on the test failure. Do you know why RM is likely to allocate more containers than AM requested? Is it related to what YARN-1902 described? > TestDistributedShell occasionally fails in trunk > > > Key: YARN-1872 > URL: https://issues.apache.org/jira/browse/YARN-1872 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Hong Zhiguo >Priority: Blocker > Attachments: TestDistributedShell.out, YARN-1872.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : > TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and > TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1872) TestDistributedShell occasionally fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1872: -- Priority: Blocker (was: Major) Target Version/s: 2.4.1 Labels: (was: patch) > TestDistributedShell occasionally fails in trunk > > > Key: YARN-1872 > URL: https://issues.apache.org/jira/browse/YARN-1872 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Hong Zhiguo >Priority: Blocker > Attachments: TestDistributedShell.out, YARN-1872.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/520/console : > TestDistributedShell#testDSShellWithCustomLogPropertyFile failed and > TestDistributedShell#testDSShell timed out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails
[ https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960281#comment-13960281 ] Hudson commented on YARN-1837: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5458 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5458/]) YARN-1837. Fixed TestMoveApplication#testMoveRejectedByScheduler failure. Contributed by Hong Zhiguo (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1584862) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java > TestMoveApplication.testMoveRejectedByScheduler randomly fails > -- > > Key: YARN-1837 > URL: https://issues.apache.org/jira/browse/YARN-1837 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Tsuyoshi OZAWA >Assignee: Hong Zhiguo > Fix For: 2.4.1 > > Attachments: YARN-1837.patch > > > TestMoveApplication#testMoveRejectedByScheduler fails because of > NullPointerException. It looks caused by unhandled exception handling at > server-side. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails
[ https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960259#comment-13960259 ] Jian He commented on YARN-1837: --- One more observation is that move is allowed at submitted state? not sure that's expected or not. Unrelevant to this patch. Checking this in. > TestMoveApplication.testMoveRejectedByScheduler randomly fails > -- > > Key: YARN-1837 > URL: https://issues.apache.org/jira/browse/YARN-1837 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Tsuyoshi OZAWA >Assignee: Hong Zhiguo > Attachments: YARN-1837.patch > > > TestMoveApplication#testMoveRejectedByScheduler fails because of > NullPointerException. It looks caused by unhandled exception handling at > server-side. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1837) TestMoveApplication.testMoveRejectedByScheduler randomly fails
[ https://issues.apache.org/jira/browse/YARN-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960255#comment-13960255 ] Jian He commented on YARN-1837: --- looks good to me, +1 > TestMoveApplication.testMoveRejectedByScheduler randomly fails > -- > > Key: YARN-1837 > URL: https://issues.apache.org/jira/browse/YARN-1837 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Tsuyoshi OZAWA >Assignee: Hong Zhiguo > Attachments: YARN-1837.patch > > > TestMoveApplication#testMoveRejectedByScheduler fails because of > NullPointerException. It looks caused by unhandled exception handling at > server-side. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sietse T. Au updated YARN-1902: --- Description: Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. was: Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0 >Reporter: Sietse T. Au > Labels: patch > Attachments: YARN-1902.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sietse T. Au updated YARN-1902: --- Description: Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. was: Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: This behavior does not occur when no containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0 >Reporter: Sietse T. Au > Labels: patch > Attachments: YARN-1902.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sietse T. Au updated YARN-1902: --- Attachment: YARN-1902.patch > Allocation of too many containers when a second request is done with the same > resource capability > - > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.2.0, 2.3.0 >Reporter: Sietse T. Au > Labels: patch > Attachments: YARN-1902.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > This behavior does not occur when no containers are started between the > allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960026#comment-13960026 ] Jason Lowe commented on YARN-1901: -- This appears to be a duplicate of HIVE-6638. As [~ozawa] mentioned, AMs are restarted when the RM restarts until YARN-556 is addressed. When an AM restarts, it is not automatically the case that completed tasks will be recovered -- it must be supported by the output committer. HIVE-6638 is updating Hive's OutputCommitter so it can support task recovery upon AM restart. > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
Sietse T. Au created YARN-1902: -- Summary: Allocation of too many containers when a second request is done with the same resource capability Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.3.0, 2.2.0 Reporter: Sietse T. Au Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: This behavior does not occur when no containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of Map is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959815#comment-13959815 ] Fengdong Yu commented on YARN-1901: --- Hi [~oazwa], Can you search the mail list of yarn-dev, I had a mail for this issue. This issue is only for Hive jobs. It works well for general MR jobs.(only unfinished tasks restart, all finished tasks not re-run) > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959731#comment-13959731 ] Tsuyoshi OZAWA commented on YARN-1901: -- Sorry, I typoed and it may cunfuse you. The current hadoop supports: "RM can be able to continue running existing applications on cluster after the RM has been restarted. Clients should not have to re-submit currently running/submitted apps." And work-preserving restart is under development on YARN-556. > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1901) All tasks restart during RM failover on Hive
[ https://issues.apache.org/jira/browse/YARN-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959726#comment-13959726 ] Tsuyoshi OZAWA commented on YARN-1901: -- Thank you for reporting, [~azuryy]. Currently, AM restarts after restarting RM. To address the problem, we have discussion under YARN-556. The current hadoop supports: "RM can be able to continue running existing applications on cluster after the RM has been restarted. " For more detail, please see the design note on YARN-128. https://issues.apache.org/jira/secure/attachment/12552867/RMRestartPhase1.pdf > All tasks restart during RM failover on Hive > > > Key: YARN-1901 > URL: https://issues.apache.org/jira/browse/YARN-1901 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Fengdong Yu > > I built from trunk, and configured RM Ha, then I submitted a hive job. > there are total 11 maps, then I stopped active RM when 6 maps finished. > but Hive shows me all map tasks restat again. This is conflict with the > design description. > job progress: > {code} > 2014-03-31 18:44:14,088 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 713.84 sec > 2014-03-31 18:44:15,128 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 722.83 sec > 2014-03-31 18:44:16,160 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 731.95 sec > 2014-03-31 18:44:17,191 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 744.17 sec > 2014-03-31 18:44:18,220 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 756.22 sec > 2014-03-31 18:44:19,250 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 762.4 > sec > 2014-03-31 18:44:20,281 Stage-1 map = 68%, reduce = 0%, Cumulative CPU > 774.64 sec > 2014-03-31 18:44:21,306 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 786.49 sec > 2014-03-31 18:44:22,334 Stage-1 map = 70%, reduce = 0%, Cumulative CPU > 792.59 sec > 2014-03-31 18:44:23,363 Stage-1 map = 73%, reduce = 0%, Cumulative CPU > 807.58 sec > 2014-03-31 18:44:24,392 Stage-1 map = 77%, reduce = 0%, Cumulative CPU > 815.96 sec > 2014-03-31 18:44:25,416 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 823.83 sec > 2014-03-31 18:44:26,443 Stage-1 map = 80%, reduce = 0%, Cumulative CPU > 826.84 sec > 2014-03-31 18:44:27,472 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 832.16 sec > 2014-03-31 18:44:28,501 Stage-1 map = 84%, reduce = 0%, Cumulative CPU > 839.73 sec > 2014-03-31 18:44:29,531 Stage-1 map = 86%, reduce = 0%, Cumulative CPU > 844.45 sec > 2014-03-31 18:44:30,564 Stage-1 map = 82%, reduce = 0%, Cumulative CPU > 760.34 sec > 2014-03-31 18:44:31,728 Stage-1 map = 0%, reduce = 0% > 2014-03-31 18:45:06,918 Stage-1 map = 2%, reduce = 0%, Cumulative CPU > 213.81 sec > 2014-03-31 18:45:07,952 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 216.83 > sec > 2014-03-31 18:45:08,979 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 229.15 > sec > 2014-03-31 18:45:10,007 Stage-1 map = 11%, reduce = 0%, Cumulative CPU > 244.42 sec > 2014-03-31 18:45:11,040 Stage-1 map = 14%, reduce = 0%, Cumulative CPU > 247.31 sec > 2014-03-31 18:45:12,072 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 259.5 > sec > 2014-03-31 18:45:13,105 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 274.72 sec > 2014-03-31 18:45:14,135 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 280.76 sec > 2014-03-31 18:45:15,170 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 292.9 > sec > 2014-03-31 18:45:16,202 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 305.16 sec > 2014-03-31 18:45:17,233 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 314.21 sec > 2014-03-31 18:45:18,264 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 323.34 sec > 2014-03-31 18:45:19,294 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 335.6 sec > 2014-03-31 18:45:20,325 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 344.71 sec > 2014-03-31 18:45:21,355 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 353.8 > sec > 2014-03-31 18:45:22,385 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 366.06 sec > 2014-03-31 18:45:23,415 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 375.2 > sec > 2014-03-31 18:45:24,449 Stage-1 map = 23%, reduce = 0%, Cumulative CPU > 384.28 sec > {code} > I am using hive-0.12.0, and ZKRMStateRoot as RM store class. Hive using a > simple external table(only one column). -- This message was sent by Atlassian JIRA (v6.2#6252)