[jira] [Updated] (YARN-7527) Over-allocate node resource in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7527: --- Attachment: YARN-7527.001.patch Attaching init patch for review. > Over-allocate node resource in async-scheduling mode of CapacityScheduler > - > > Key: YARN-7527 > URL: https://issues.apache.org/jira/browse/YARN-7527 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7527.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, node resource may be > over-allocated since node resource check is ignored. > {{FiCaSchedulerApp#commonCheckContainerAllocation}} will check whether this > node have enough available resource for this proposal and return check result > (ture/false), but this result is ignored in {{CapacityScheduler#accept}} as > below. > {noformat} > commonCheckContainerAllocation(allocation, schedulerContainer); > {noformat} > If {{FiCaSchedulerApp#commonCheckContainerAllocation}} returns false, > {{CapacityScheduler#accept}} should also return false as below: > {noformat} > if (!commonCheckContainerAllocation(allocation, schedulerContainer)) { > return false; > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7527) Over-allocate node resource in async-scheduling mode of CapacityScheduler
Tao Yang created YARN-7527: -- Summary: Over-allocate node resource in async-scheduling mode of CapacityScheduler Key: YARN-7527 URL: https://issues.apache.org/jira/browse/YARN-7527 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0-alpha4, 2.9.1 Reporter: Tao Yang Assignee: Tao Yang Currently in async-scheduling mode of CapacityScheduler, node resource may be over-allocated since node resource check is ignored. {{FiCaSchedulerApp#commonCheckContainerAllocation}} will check whether this node have enough available resource for this proposal and return check result (ture/false), but this result is ignored in {{CapacityScheduler#accept}} as below. {noformat} commonCheckContainerAllocation(allocation, schedulerContainer); {noformat} If {{FiCaSchedulerApp#commonCheckContainerAllocation}} returns false, {{CapacityScheduler#accept}} should also return false as below: {noformat} if (!commonCheckContainerAllocation(allocation, schedulerContainer)) { return false; } {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7525) Incorrect query parameters in cluster nodes REST API document
[ https://issues.apache.org/jira/browse/YARN-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7525: --- Fix Version/s: (was: 2.9.1) (was: 3.0.0-alpha4) > Incorrect query parameters in cluster nodes REST API document > - > > Key: YARN-7525 > URL: https://issues.apache.org/jira/browse/YARN-7525 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7525.001.patch > > > Recently we use cluster nodes REST API and found the query parameters(state > and healthy) in document both are not exist. > Now the query paramters in document is: > {noformat} > * state - the state of the node > * healthy - true or false > {noformat} > The correct query parameters should be: > {noformat} > * states - the states of the node, specified as a comma-separated list. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7525) Incorrect query parameters in cluster nodes REST API document
[ https://issues.apache.org/jira/browse/YARN-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7525: --- Attachment: YARN-7525.001.patch > Incorrect query parameters in cluster nodes REST API document > - > > Key: YARN-7525 > URL: https://issues.apache.org/jira/browse/YARN-7525 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7525.001.patch > > > Recently we use cluster nodes REST API and found the query parameters(state > and healthy) in document both are not exist. > Now the query paramters in document is: > {noformat} > * state - the state of the node > * healthy - true or false > {noformat} > The correct query parameters should be: > {noformat} > * states - the states of the node, specified as a comma-separated list. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7525) Incorrect query parameters in cluster nodes REST API document
Tao Yang created YARN-7525: -- Summary: Incorrect query parameters in cluster nodes REST API document Key: YARN-7525 URL: https://issues.apache.org/jira/browse/YARN-7525 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 3.0.0-alpha4, 2.9.1 Reporter: Tao Yang Assignee: Tao Yang Priority: Minor Recently we use cluster nodes REST API and found the query parameters(state and healthy) in document both are not exist. Now the query paramters in document is: {noformat} * state - the state of the node * healthy - true or false {noformat} The correct query parameters should be: {noformat} * states - the states of the node, specified as a comma-separated list. {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode
[ https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256364#comment-16256364 ] Tao Yang commented on YARN-7508: Thanks [~sunilg] and [~bibinchundatt] for your review and comments. Other instances of similar usage seem good since they can guarantee that {{schedulerContainer.getSchedulerNode().getReservedContainer()}} is not null. > NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated > reserved proposal in async-scheduling mode > > > Key: YARN-7508 > URL: https://issues.apache.org/jira/browse/YARN-7508 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7508.001.patch > > > YARN-6678 have fixed the IllegalStateException problem but the debug log it > added may cause NPE when trying to print containerId of non-existed reserved > container on this node. Replace > {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} > with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can > fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container
[ https://issues.apache.org/jira/browse/YARN-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7511: --- Attachment: YARN-7511.001.patch Attaching v1 patch for review. > NPE in ContainerLocalizer when localization failed for running container > > > Key: YARN-7511 > URL: https://issues.apache.org/jira/browse/YARN-7511 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7511.001.patch > > > Error log: > {noformat} > 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106) > at > java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:834) > 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. > {noformat} > Reproduce this problem: > 1. Container was running and ContainerManagerImpl#localize was called for > this container > 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and > sent out ContainerResourceFailedEvent with null LocalResourceRequest. > 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> > container.resourceSet.resourceLocalizationFailed(null) > I think we can fix this problem through ensuring that request is not null > before remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7511) NPE in ContainerLocalizer when localization failed for running container
Tao Yang created YARN-7511: -- Summary: NPE in ContainerLocalizer when localization failed for running container Key: YARN-7511 URL: https://issues.apache.org/jira/browse/YARN-7511 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0-alpha4, 2.9.1 Reporter: Tao Yang Assignee: Tao Yang Error log: {noformat} 2017-09-30 20:14:32,839 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1106) at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.resourceLocalizationFailed(ResourceSet.java:151) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:821) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceLocalizationFailedWhileRunningTransition.transition(ContainerImpl.java:813) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1335) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:95) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1372) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1365) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:834) 2017-09-30 20:14:32,842 INFO [AsyncDispatcher ShutDown handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.. {noformat} Reproduce this problem: 1. Container was running and ContainerManagerImpl#localize was called for this container 2. Localization failed in ResourceLocalizationService$LocalizerRunner#run and sent out ContainerResourceFailedEvent with null LocalResourceRequest. 3. NPE when ResourceLocalizationFailedWhileRunningTransition#transition --> container.resourceSet.resourceLocalizationFailed(null) I think we can fix this problem through ensuring that request is not null before remove it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7509: --- Attachment: YARN-7509.001.patch Attaching v1 patch for review. > AsyncScheduleThread and ResourceCommitterService are still running after RM > is transitioned to standby > -- > > Key: YARN-7509 > URL: https://issues.apache.org/jira/browse/YARN-7509 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7509.001.patch > > > After RM is transitioned to standby, AsyncScheduleThread and > ResourceCommitterService will receive interrupt signal. When thread is > sleeping, it will ignore the interrupt signal since InterruptedException is > catched inside and the interrupt signal is cleared. > For AsyncScheduleThread, InterruptedException was catched and ignored in > CapacityScheduler#schedule. > For ResourceCommitterService, InterruptedException was catched inside and > ignored in ResourceCommitterService#run. > We should let the interrupt signal out and make these threads exit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby
Tao Yang created YARN-7509: -- Summary: AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby Key: YARN-7509 URL: https://issues.apache.org/jira/browse/YARN-7509 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha4, 2.9.1 Reporter: Tao Yang After RM is transitioned to standby, AsyncScheduleThread and ResourceCommitterService will receive interrupt signal. When thread is sleeping, it will ignore the interrupt signal since InterruptedException is catched inside and the interrupt signal is cleared. For AsyncScheduleThread, InterruptedException was catched and ignored in CapacityScheduler#schedule. For ResourceCommitterService, InterruptedException was catched inside and ignored in ResourceCommitterService#run. We should let the interrupt signal out and make these threads exit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7509) AsyncScheduleThread and ResourceCommitterService are still running after RM is transitioned to standby
[ https://issues.apache.org/jira/browse/YARN-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang reassigned YARN-7509: -- Assignee: Tao Yang > AsyncScheduleThread and ResourceCommitterService are still running after RM > is transitioned to standby > -- > > Key: YARN-7509 > URL: https://issues.apache.org/jira/browse/YARN-7509 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4, 2.9.1 >Reporter: Tao Yang >Assignee: Tao Yang > > After RM is transitioned to standby, AsyncScheduleThread and > ResourceCommitterService will receive interrupt signal. When thread is > sleeping, it will ignore the interrupt signal since InterruptedException is > catched inside and the interrupt signal is cleared. > For AsyncScheduleThread, InterruptedException was catched and ignored in > CapacityScheduler#schedule. > For ResourceCommitterService, InterruptedException was catched inside and > ignored in ResourceCommitterService#run. > We should let the interrupt signal out and make these threads exit. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode
[ https://issues.apache.org/jira/browse/YARN-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7508: --- Attachment: YARN-7508.001.patch Uploading v1 patch. [~sunilg], Could you help to review, please? > NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated > reserved proposal in async-scheduling mode > > > Key: YARN-7508 > URL: https://issues.apache.org/jira/browse/YARN-7508 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7508.001.patch > > > YARN-6678 have fixed the IllegalStateException problem but the debug log it > added may cause NPE when trying to print containerId of non-existed reserved > container on this node. Replace > {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} > with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can > fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7508) NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode
Tao Yang created YARN-7508: -- Summary: NPE in FiCaSchedulerApp when debug log enabled and try to commit outdated reserved proposal in async-scheduling mode Key: YARN-7508 URL: https://issues.apache.org/jira/browse/YARN-7508 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0-alpha4, 2.9.0 Reporter: Tao Yang Assignee: Tao Yang YARN-6678 have fixed the IllegalStateException problem but the debug log it added may cause NPE when trying to print containerId of non-existed reserved container on this node. Replace {{schedulerContainer.getSchedulerNode().getReservedContainer().getContainerId()}} with {{schedulerContainer.getSchedulerNode().getReservedContainer()}} can fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
[ https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7461: --- Attachment: YARN-7461.003.patch Updating the patch to skip ratio calculation for resource types whose left value and right value are both zero. [~templedf], could you help to review please? > DominantResourceCalculator#ratio calculation problem when right resource > contains zero value > > > Key: YARN-7461 > URL: https://issues.apache.org/jira/browse/YARN-7461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7461.001.patch, YARN-7461.002.patch, > YARN-7461.003.patch > > > Currently DominantResourceCalculator#ratio may return wrong result when right > resource contains zero value. For example, there are three resource types > such as, leftResource=<5, 5, 0> and > rightResource=<10, 10, 0>, we expect the result of > DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but > currently is NaN. > There should be a verification before divide calculation to ensure that > dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics
[ https://issues.apache.org/jira/browse/YARN-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7489: --- Attachment: YARN-7489.001.patch > ConcurrentModificationException in RMAppImpl#getRMAppMetrics > > > Key: YARN-7489 > URL: https://issues.apache.org/jira/browse/YARN-7489 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7489.001.patch > > > The REST clients have sometimes failed to query applications through apps > REST API in RMWebService and it happened when iterating > attempts(RMWebServices#getApps --> AppInfo# --> > RMAppImpl#getRMAppMetrics) and meanwhile these attempts > changed(AttemptFailedTransition#transition --> > RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). > Application state changed within the lockup period of writeLock in RMAppImpl, > so that we can add readLock before iterating attempts to fix this problem. > Exception stack: > {noformat} > java.util.ConcurrentModificationException > at > java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) > at > java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597) > at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics
[ https://issues.apache.org/jira/browse/YARN-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7489: --- Description: The REST clients have sometimes failed to query applications through apps REST API in RMWebService and it happened when iterating attempts(RMWebServices#getApps --> AppInfo# --> RMAppImpl#getRMAppMetrics) and meanwhile these attempts changed(AttemptFailedTransition#transition --> RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). Application state changed within the lockup period of writeLock in RMAppImpl, so that we can add readLock before iterating attempts to fix this problem. Exception stack: {noformat} java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) at java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) {noformat} was: The REST clients have sometimes failed to query applications through apps REST API in RMWebService and it happened when iterating attempts(RMWebServices#getApps --> AppInfo# --> RMAppImpl#getRMAppMetrics) and meanwhile these attempts changed(AttemptFailedTransition#transition --> RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). Application state changed within the lockup period of writeLock in RMAppImpl, so that we can add readLock before iterating attempts to fix this problem. Error logs: {noformat} java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) at java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at
[jira] [Created] (YARN-7489) ConcurrentModificationException in RMAppImpl#getRMAppMetrics
Tao Yang created YARN-7489: -- Summary: ConcurrentModificationException in RMAppImpl#getRMAppMetrics Key: YARN-7489 URL: https://issues.apache.org/jira/browse/YARN-7489 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Tao Yang Assignee: Tao Yang The REST clients have sometimes failed to query applications through apps REST API in RMWebService and it happened when iterating attempts(RMWebServices#getApps --> AppInfo# --> RMAppImpl#getRMAppMetrics) and meanwhile these attempts changed(AttemptFailedTransition#transition --> RMAppImpl#createAndStartNewAttempt --> RMAppImpl#createNewAttempt). Application state changed within the lockup period of writeLock in RMAppImpl, so that we can add readLock before iterating attempts to fix this problem. Error logs: {noformat} java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) at java.util.LinkedHashMap$LinkedValueIterator.next(LinkedHashMap.java:747) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1487) at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppInfo.(AppInfo.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:597) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:178) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7471) queueUsagePercentage is wrongly calculated for applications in zero-capacity queues
[ https://issues.apache.org/jira/browse/YARN-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7471: --- Attachment: YARN-7471.001.patch > queueUsagePercentage is wrongly calculated for applications in zero-capacity > queues > --- > > Key: YARN-7471 > URL: https://issues.apache.org/jira/browse/YARN-7471 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7471.001.patch > > > For applicaitons in zero-capacity queues, queueUsagePercentage is wrongly > calculated to INFINITY with expression (queueUsagePercentage = usedResource / > (totalPartitionRes * queueAbsMaxCapPerPartition) when the > queueAbsMaxCapPerPartition=0. > We can add a precondition (queueAbsMaxCapPerPartition != 0) before this > calculation to fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7471) queueUsagePercentage is wrongly calculated for applications in zero-capacity queues
Tao Yang created YARN-7471: -- Summary: queueUsagePercentage is wrongly calculated for applications in zero-capacity queues Key: YARN-7471 URL: https://issues.apache.org/jira/browse/YARN-7471 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0-alpha4 Reporter: Tao Yang Assignee: Tao Yang For applicaitons in zero-capacity queues, queueUsagePercentage is wrongly calculated to INFINITY with expression (queueUsagePercentage = usedResource / (totalPartitionRes * queueAbsMaxCapPerPartition) when the queueAbsMaxCapPerPartition=0. We can add a precondition (queueAbsMaxCapPerPartition != 0) before this calculation to fix this problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
[ https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247143#comment-16247143 ] Tao Yang commented on YARN-7461: Thanks [~templedf] for your comments. I wrongly assumed that lhs is fit in rhs and ignored the case you mentioned. I think the right calculations with zero for DominantResourceCalculator#ratio should be: <1,1,0> / <1,1,1> = 1; <1,1,1> / <1,1,0> = INFINITY; <1,1,0> / <1,1,0> = 1; Thoughts? > DominantResourceCalculator#ratio calculation problem when right resource > contains zero value > > > Key: YARN-7461 > URL: https://issues.apache.org/jira/browse/YARN-7461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7461.001.patch, YARN-7461.002.patch > > > Currently DominantResourceCalculator#ratio may return wrong result when right > resource contains zero value. For example, there are three resource types > such as, leftResource=<5, 5, 0> and > rightResource=<10, 10, 0>, we expect the result of > DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but > currently is NaN. > There should be a verification before divide calculation to ensure that > dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
[ https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245157#comment-16245157 ] Tao Yang edited comment on YARN-7461 at 11/9/17 3:33 AM: - Thanks [~templedf] for your comments. I tried to reproduce our problem before which is not necessary, thanks for reminding me. Replaced these code with {{setupExtraResource()}} in v2 patch. was (Author: tao yang): Thanks [~templedf] for your comments. I tried to reproduce our problem before which is not necessary, so I replaced these code with {{setupExtraResource()}} in v2 patch. > DominantResourceCalculator#ratio calculation problem when right resource > contains zero value > > > Key: YARN-7461 > URL: https://issues.apache.org/jira/browse/YARN-7461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7461.001.patch, YARN-7461.002.patch > > > Currently DominantResourceCalculator#ratio may return wrong result when right > resource contains zero value. For example, there are three resource types > such as, leftResource=<5, 5, 0> and > rightResource=<10, 10, 0>, we expect the result of > DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but > currently is NaN. > There should be a verification before divide calculation to ensure that > dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
[ https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7461: --- Attachment: YARN-7461.002.patch Thanks [~templedf] for your comments. I tried to reproduce our problem before which is not necessary, so I replaced these code with {{setupExtraResource()}} in v2 patch. > DominantResourceCalculator#ratio calculation problem when right resource > contains zero value > > > Key: YARN-7461 > URL: https://issues.apache.org/jira/browse/YARN-7461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7461.001.patch, YARN-7461.002.patch > > > Currently DominantResourceCalculator#ratio may return wrong result when right > resource contains zero value. For example, there are three resource types > such as, leftResource=<5, 5, 0> and > rightResource=<10, 10, 0>, we expect the result of > DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but > currently is NaN. > There should be a verification before divide calculation to ensure that > dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
[ https://issues.apache.org/jira/browse/YARN-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7461: --- Attachment: YARN-7461.001.patch > DominantResourceCalculator#ratio calculation problem when right resource > contains zero value > > > Key: YARN-7461 > URL: https://issues.apache.org/jira/browse/YARN-7461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Tao Yang >Priority: Minor > Attachments: YARN-7461.001.patch > > > Currently DominantResourceCalculator#ratio may return wrong result when right > resource contains zero value. For example, there are three resource types > such as, leftResource=<5, 5, 0> and > rightResource=<10, 10, 0>, we expect the result of > DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but > currently is NaN. > There should be a verification before divide calculation to ensure that > dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7461) DominantResourceCalculator#ratio calculation problem when right resource contains zero value
Tao Yang created YARN-7461: -- Summary: DominantResourceCalculator#ratio calculation problem when right resource contains zero value Key: YARN-7461 URL: https://issues.apache.org/jira/browse/YARN-7461 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha4 Reporter: Tao Yang Priority: Minor Currently DominantResourceCalculator#ratio may return wrong result when right resource contains zero value. For example, there are three resource types such as, leftResource=<5, 5, 0> and rightResource=<10, 10, 0>, we expect the result of DominantResourceCalculator#ratio(leftResource, rightResource) is 0.5 but currently is NaN. There should be a verification before divide calculation to ensure that dividend is not zero. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146586#comment-16146586 ] Tao Yang edited comment on YARN-6737 at 8/30/17 4:33 AM: - Upload v1 patch for trunk. Sorry to be late for this update. I have scanned all the usages of AbstractYarnScheduler#getApplicationAttempt and CapacityScheduler#getApplicationAttempt and found one potential problem in QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode. {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (!app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} NPE should happen here if app is no longer exist, I think we can correct it through adding null check for app like this (the outer caller will skip this invalid reservedContainer): {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (app == null || !app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} [~sunilg] Please help to review this patch. Thanks! was (Author: tao yang): Upload v1 patch for trunk. Sorry to be late for this update. I have scanned all the usages of AbstractYarnScheduler#getApplicationAttempt and CapacityScheduler#getApplicationAttempt and found one potential problem in QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode. {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (!app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} NPE should happen here if app is no longer exist, I think we can correct it through adding null check for app like this (the outer caller will skip this invalid reservedContainer): {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (app == null || !app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} [~sunilg] Please help to review this patch. Thanks! > Rename getApplicationAttempt to getCurrentAttempt in > AbstractYarnScheduler/CapacityScheduler > > > Key: YARN-6737 > URL: https://issues.apache.org/jira/browse/YARN-6737 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Priority: Minor > Attachments: YARN-6737.001.patch > > > As discussed in YARN-6714 > (https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158) > AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it > discarded application_attempt_id and always return the latest attempt. We > should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId > to applicationId. 3) Took a scan of all usages to see if any similar issue > could happen. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6737: --- Attachment: YARN-6737.001.patch Upload v1 patch for trunk. Sorry to be late for this update. I have scanned all the usages of AbstractYarnScheduler#getApplicationAttempt and CapacityScheduler#getApplicationAttempt and found one potential problem in QueuePriorityContainerCandidateSelector#preChecksForMovingReservedContainerToNode. {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (!app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} NPE should happen here if app is no longer exist, I think we can correct it through adding null check for app like this (the outer caller will skip this invalid reservedContainer): {code} FiCaSchedulerApp app = preemptionContext.getScheduler().getCurrentApplicationAttempt( reservedContainer.getApplicationAttemptId()); if (app == null || !app.getAppSchedulingInfo().canDelayTo( reservedContainer.getAllocatedSchedulerKey(), ResourceRequest.ANY)) { // This is a hard locality request return false; } {code} [~sunilg] Please help to review this patch. Thanks! > Rename getApplicationAttempt to getCurrentAttempt in > AbstractYarnScheduler/CapacityScheduler > > > Key: YARN-6737 > URL: https://issues.apache.org/jira/browse/YARN-6737 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Priority: Minor > Attachments: YARN-6737.001.patch > > > As discussed in YARN-6714 > (https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158) > AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it > discarded application_attempt_id and always return the latest attempt. We > should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId > to applicationId. 3) Took a scan of all usages to see if any similar issue > could happen. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146470#comment-16146470 ] Tao Yang commented on YARN-7037: Thanks [~djp] for review and commit ! > Optimize data transfer with zero-copy approach for containerlogs REST API in > NMWebServices > -- > > Key: YARN-7037 > URL: https://issues.apache.org/jira/browse/YARN-7037 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Fix For: 2.9.0, 3.0.0-beta1, 2.8.3 > > Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch > > > Split this improvement from YARN-6259. > It's useful to read container logs more efficiently. With zero-copy approach, > data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) > can be optimized to pipeline(disk --> read buffer --> socket buffer) . > In my local test, time cost of copying 256MB file with zero-copy can be > reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141364#comment-16141364 ] Tao Yang commented on YARN-7037: Thanks [~djp] for looking into the issue. I chose to add new method since this optimization can not cover all use cases, zero-copy is only fit for local read. LogToolUtils#outputContainerLog was used for both local log which can be optimized by FileInputStream and aggregated log which can't because it's transferred by DataInputStream from remote. > Optimize data transfer with zero-copy approach for containerlogs REST API in > NMWebServices > -- > > Key: YARN-7037 > URL: https://issues.apache.org/jira/browse/YARN-7037 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch > > > Split this improvement from YARN-6259. > It's useful to read container logs more efficiently. With zero-copy approach, > data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) > can be optimized to pipeline(disk --> read buffer --> socket buffer) . > In my local test, time cost of copying 256MB file with zero-copy can be > reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7037: --- Attachment: YARN-7037.001.patch YARN-7037.branch-2.8.001.patch Upload v1 patch for trunk and update v1 patch for branch-2.8(There is no need to close i/o channel). > Optimize data transfer with zero-copy approach for containerlogs REST API in > NMWebServices > -- > > Key: YARN-7037 > URL: https://issues.apache.org/jira/browse/YARN-7037 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch > > > Split this improvement from YARN-6259. > It's useful to read container logs more efficiently. With zero-copy approach, > data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) > can be optimized to pipeline(disk --> read buffer --> socket buffer) . > In my local test, time cost of copying 256MB file with zero-copy can be > reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7037: --- Attachment: (was: YARN-7037.branch-2.8.001.patch) > Optimize data transfer with zero-copy approach for containerlogs REST API in > NMWebServices > -- > > Key: YARN-7037 > URL: https://issues.apache.org/jira/browse/YARN-7037 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7037.001.patch, YARN-7037.branch-2.8.001.patch > > > Split this improvement from YARN-6259. > It's useful to read container logs more efficiently. With zero-copy approach, > data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) > can be optimized to pipeline(disk --> read buffer --> socket buffer) . > In my local test, time cost of copying 256MB file with zero-copy can be > reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key
[ https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6257: --- Attachment: YARN-6257.002.patch Upload v2 patch for review. RM REST document has been updated. > CapacityScheduler REST API produces incorrect JSON - JSON object > operationsInfo contains deplicate key > -- > > Key: YARN-6257 > URL: https://issues.apache.org/jira/browse/YARN-6257 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-6257.001.patch, YARN-6257.002.patch > > > In response string of CapacityScheduler REST API, > scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a > JSON object : > {code} > "operationsInfo":{ > > "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}} > } > {code} > To solve this problem, I suppose the type of operationsInfo field in > CapacitySchedulerHealthInfo class should be converted from Map to List. > After convert to List, The operationsInfo string will be: > {code} > "operationInfos":[ > > {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"} > ] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6259) Support pagination and optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130303#comment-16130303 ] Tao Yang commented on YARN-6259: Thanks [~djp] for your suggestions. It makes sense to me. I have created YARN-7037 to handle performance improvement and the patch of this issue will be updated later. I noticed that there are many differences between 2.8 and 2.9/trunk, 2.9/trunk supports getting head or tail part of log file. It's close to our requirements but still not enough to support pagination. > Support pagination and optimize data transfer with zero-copy approach for > containerlogs REST API in NMWebServices > - > > Key: YARN-6259 > URL: https://issues.apache.org/jira/browse/YARN-6259 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6259.001.patch > > > Currently containerlogs REST API in NMWebServices will read and send the > entire content of container logs. Most of container logs are large and it's > useful to support pagination. > * Add pagesize and pageindex parameters for containerlogs REST API > {code} > URL: http:///ws/v1/node/containerlogs// > QueryParams: > pagesize - max bytes of one page , default 1MB > pageindex - index of required page, default 0, can be nagative(set -1 will > get the last page content) > {code} > * Add containerlogs-info REST API since sometimes we need to know the > totalSize/pageSize/pageCount info of log > {code} > URL: > http:///ws/v1/node/containerlogs-info// > QueryParams: > pagesize - max bytes of one page , default 1MB > Response example: > {"logInfo":{"totalSize":2497280,"pageSize":1048576,"pageCount":3}} > {code} > Moreover, the data transfer pipeline (disk --> read buffer --> NM buffer --> > socket buffer) can be optimized to pipeline(disk --> read buffer --> socket > buffer) with zero-copy approach. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-7037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7037: --- Attachment: YARN-7037.branch-2.8.001.patch Upload v1 patch for review. > Optimize data transfer with zero-copy approach for containerlogs REST API in > NMWebServices > -- > > Key: YARN-7037 > URL: https://issues.apache.org/jira/browse/YARN-7037 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7037.branch-2.8.001.patch > > > Split this improvement from YARN-6259. > It's useful to read container logs more efficiently. With zero-copy approach, > data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) > can be optimized to pipeline(disk --> read buffer --> socket buffer) . > In my local test, time cost of copying 256MB file with zero-copy can be > reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7037) Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
Tao Yang created YARN-7037: -- Summary: Optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices Key: YARN-7037 URL: https://issues.apache.org/jira/browse/YARN-7037 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.8.3 Reporter: Tao Yang Assignee: Tao Yang Split this improvement from YARN-6259. It's useful to read container logs more efficiently. With zero-copy approach, data transfer pipeline (disk --> read buffer --> NM buffer --> socket buffer) can be optimized to pipeline(disk --> read buffer --> socket buffer) . In my local test, time cost of copying 256MB file with zero-copy can be reduced from 12 seconds to 2.5 seconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key
[ https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129955#comment-16129955 ] Tao Yang commented on YARN-6257: Thanks [~sunilg] and [~leftnoteasy]. it makes sense to me. I will update the document and upload a new patch for review. > CapacityScheduler REST API produces incorrect JSON - JSON object > operationsInfo contains deplicate key > -- > > Key: YARN-6257 > URL: https://issues.apache.org/jira/browse/YARN-6257 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-6257.001.patch > > > In response string of CapacityScheduler REST API, > scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a > JSON object : > {code} > "operationsInfo":{ > > "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}} > } > {code} > To solve this problem, I suppose the type of operationsInfo field in > CapacitySchedulerHealthInfo class should be converted from Map to List. > After convert to List, The operationsInfo string will be: > {code} > "operationInfos":[ > > {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"} > ] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5683) Support specifying storage type for per-application local dirs
[ https://issues.apache.org/jira/browse/YARN-5683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-5683: --- Description: h3. Introduction * Some applications of various frameworks (Flink, Spark and MapReduce etc) using local storage (checkpoint, shuffle etc) might require high IO performance. It's useful to allocate local directories to high performance storage media for these applications on heterogeneous clusters. * YARN does not distinguish different storage types and hence applications cannot selectively use storage media with different performance characteristics. Adding awareness of storage media can allow YARN to make better decisions about the placement of local directories. h3. Approach * NodeManager will distinguish storage types for local directories. ** yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs configuration should allow the cluster administrator to optionally specify the storage type for each local directories. Example: [SSD]/disk1/nm-local-dir,/disk2/nm-local-dir,/disk3/nm-local-dir (equals to [SSD]/disk1/nm-local-dir,[DISK]/disk2/nm-local-dir,[DISK]/disk3/nm-local-dir) ** StorageType defines DISK/SSD storage types and takes DISK as the default storage type. ** StorageLocation separates storage type and directory path, used by LocalDirAllocator to aware the types of local dirs, the default storage type is DISK. ** getLocalPathForWrite method of LocalDirAllcator will prefer to choose the local directory of the specified storage type, and will fallback to not care storage type if the requirement can not be satisfied. ** Support for container related local/log directories by ContainerLaunch. All application frameworks can set the environment variables (LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE) to specified the desired storage type of local/log directories, and choose to not launch container if fallback through these environment variables (ENSURE_LOCAL_STORAGE_TYPE and ENSURE_LOG_STORAGE_TYPE). * Allow specified storage type for various frameworks (Take MapReduce as an example) ** Add new configurations should allow application administrator to optionally specify the storage type of local/log directories and fallback strategy (MapReduce configurations: mapreduce.job.local-storage-type, mapreduce.job.log-storage-type, mapreduce.job.ensure-local-storage-type and mapreduce.job.ensure-log-storage-type). ** Support for container work directories. Set the environment variables includes LOCAL_STORAGE_TYPE and LOG_STORAGE_TYPE according to configurations above for ContainerLaunchContext and ApplicationSubmissionContext. (MapReduce should update YARNRunner and TaskAttemptImpl) ** Add storage type prefix for request path to support for other local directories of frameworks (such as shuffle directories for MapReduce). (MapReduce should update YarnOutputFiles, MROutputFiles and YarnChild to support for output/work directories) ** Flow diagram for MapReduce framework !flow_diagram_for_MapReduce-2.png! h3. Further Discussion * Scheduling : The requirement of storage type for local/log directories may not be satisfied for a part of nodes on heterogeneous clusters. To achieve global optimum, scheduler should aware and manage disk resources. ** Approach-1: Based on node attributes (YARN-3409), Scheduler can allocate containers which have SSD requirement on nodes with attribute:ssd=true. ** Approach-2: Based on extended resource model (YARN-3926), it's easy to support scheduling through extending resource models like vdisk and vssd using this feature, but hard to measure for applications and isolate for non-CFQ based disks. * Fallback strategy still needs to be concerned. Certain applications might not work well when the requirement of storage type is not satisfied. When none of desired storage type disk are available, should container launching be failed? let AM handle? We have implemented a fallback strategy that fail to launch container when none of desired storage type disk are available. Is there some better methods? This feature has been used for half a year to meet the needs of some applications on Alibaba search clusters. Please feel free to give your suggestions and opinions. was: h3. Introduction * Some applications of various frameworks (Flink, Spark and MapReduce etc) using local storage (checkpoint, shuffle etc) might require high IO performance. It's useful to allocate local directories to high performance storage media for these applications on heterogeneous clusters. * YARN does not distinguish different storage types and hence applications cannot selectively use storage media with different performance characteristics. Adding awareness of storage media can allow YARN to make better decisions about the placement of local directories. h3. Approach * NodeManager will distinguish storage types for local directories. **
[jira] [Updated] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale of queues
[ https://issues.apache.org/jira/browse/YARN-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7004: --- Summary: Add configs cache to optimize refreshQueues performance for large scale of queues (was: Add configs cache to optimize refreshQueues performance for large scale queues) > Add configs cache to optimize refreshQueues performance for large scale of > queues > - > > Key: YARN-7004 > URL: https://issues.apache.org/jira/browse/YARN-7004 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7004.001.patch > > > We have requirements for large scale queues in our production environment to > serve for many projects. So we did some tests for more than 5000 queues and > found that refreshQueues process took more than 1 minute. The refreshQueues > process costs most of time on iterating over all configurations to get > accessible-node-labels and ordering-policy configs for every queue. > Loading queue configs from cache should be beneficial to reduce time costs > (optimized from 1 minutes to 3 seconds for 5000 queues in our test) when > initializing/reinitializing queues. So I propose to load queue configs into > cache in CapacityScheduler#initializeQueues and > CapacityScheduler#reinitializeQueues. If cache has not be loaded on other > scenes, such as in test cases, it still can get queue configs by iterating > over all configurations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7005) Skip unnecessary sorting and iterating process for child queues without pending resource to optimize schedule performance
Tao Yang created YARN-7005: -- Summary: Skip unnecessary sorting and iterating process for child queues without pending resource to optimize schedule performance Key: YARN-7005 URL: https://issues.apache.org/jira/browse/YARN-7005 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.0.0-alpha4, 2.9.0 Reporter: Tao Yang Nowadays even if there is only one pending app in a queue, the scheduling process will go through all queues anyway and costs most of time on sorting and iterating child queues in ParentQueue#assignContainersToChildQueues. IIUIC, queues that have no pending resource can be skipped for sorting and iterating process to reduce time cost, obviously for a cluster with many queues. Please feel free to correct me if I ignore something else. Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale queues
[ https://issues.apache.org/jira/browse/YARN-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7004: --- Attachment: YARN-7004.001.patch Uploaded v1 patch for review. > Add configs cache to optimize refreshQueues performance for large scale queues > -- > > Key: YARN-7004 > URL: https://issues.apache.org/jira/browse/YARN-7004 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7004.001.patch > > > We have requirements for large scale queues in our production environment to > serve for many projects. So we did some tests for more than 5000 queues and > found that refreshQueues process took more than 1 minute. The refreshQueues > process costs most of time on iterating over all configurations to get > accessible-node-labels and ordering-policy configs for every queue. > Loading queue configs from cache should be beneficial to reduce time costs > (optimized from 1 minutes to 3 seconds for 5000 queues in our test) when > initializing/reinitializing queues. So I propose to load queue configs into > cache in CapacityScheduler#initializeQueues and > CapacityScheduler#reinitializeQueues. If cache has not be loaded on other > scenes, such as in test cases, it still can get queue configs by iterating > over all configurations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7004) Add configs cache to optimize refreshQueues performance for large scale queues
Tao Yang created YARN-7004: -- Summary: Add configs cache to optimize refreshQueues performance for large scale queues Key: YARN-7004 URL: https://issues.apache.org/jira/browse/YARN-7004 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 3.0.0-alpha4, 2.9.0 Reporter: Tao Yang Assignee: Tao Yang We have requirements for large scale queues in our production environment to serve for many projects. So we did some tests for more than 5000 queues and found that refreshQueues process took more than 1 minute. The refreshQueues process costs most of time on iterating over all configurations to get accessible-node-labels and ordering-policy configs for every queue. Loading queue configs from cache should be beneficial to reduce time costs (optimized from 1 minutes to 3 seconds for 5000 queues in our test) when initializing/reinitializing queues. So I propose to load queue configs into cache in CapacityScheduler#initializeQueues and CapacityScheduler#reinitializeQueues. If cache has not be loaded on other scenes, such as in test cases, it still can get queue configs by iterating over all configurations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart
[ https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7003: --- Attachment: YARN-7003.001.patch > DRAINING state of queues can't be recovered after RM restart > > > Key: YARN-7003 > URL: https://issues.apache.org/jira/browse/YARN-7003 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-7003.001.patch > > > DRAINING state is a temporary state in RM memory, when queue state is set to > be STOPPED but there are still some pending or active apps in it, the queue > state will be changed to DRAINING instead of STOPPED after refreshing queues. > We've encountered the problem that the state of this queue will aways be > STOPPED after RM restarted, so that it can be removed at any time and leave > some apps in a non-existing queue. > To fix this problem, we could recover DRAINING state in the recovery process > of pending/active apps. I will upload a patch with test case later for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart
[ https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7003: --- Affects Version/s: (was: 3.0.0-alpha3) 3.0.0-alpha4 > DRAINING state of queues can't be recovered after RM restart > > > Key: YARN-7003 > URL: https://issues.apache.org/jira/browse/YARN-7003 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang > > DRAINING state is a temporary state in RM memory, when queue state is set to > be STOPPED but there are still some pending or active apps in it, the queue > state will be changed to DRAINING instead of STOPPED after refreshing queues. > We've encountered the problem that the state of this queue will aways be > STOPPED after RM restarted, so that it can be removed at any time and leave > some apps in a non-existing queue. > To fix this problem, we could recover DRAINING state in the recovery process > of pending/active apps. I will upload a patch with test case later for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7003) DRAINING state of queues can't be recovered after RM restart
[ https://issues.apache.org/jira/browse/YARN-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7003: --- Affects Version/s: 2.9.0 > DRAINING state of queues can't be recovered after RM restart > > > Key: YARN-7003 > URL: https://issues.apache.org/jira/browse/YARN-7003 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Tao Yang > > DRAINING state is a temporary state in RM memory, when queue state is set to > be STOPPED but there are still some pending or active apps in it, the queue > state will be changed to DRAINING instead of STOPPED after refreshing queues. > We've encountered the problem that the state of this queue will aways be > STOPPED after RM restarted, so that it can be removed at any time and leave > some apps in a non-existing queue. > To fix this problem, we could recover DRAINING state in the recovery process > of pending/active apps. I will upload a patch with test case later for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7003) DRAINING state of queues can't be recovered after RM restart
Tao Yang created YARN-7003: -- Summary: DRAINING state of queues can't be recovered after RM restart Key: YARN-7003 URL: https://issues.apache.org/jira/browse/YARN-7003 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0-alpha3 Reporter: Tao Yang DRAINING state is a temporary state in RM memory, when queue state is set to be STOPPED but there are still some pending or active apps in it, the queue state will be changed to DRAINING instead of STOPPED after refreshing queues. We've encountered the problem that the state of this queue will aways be STOPPED after RM restarted, so that it can be removed at any time and leave some apps in a non-existing queue. To fix this problem, we could recover DRAINING state in the recovery process of pending/active apps. I will upload a patch with test case later for review. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-6044) Resource bar of Capacity Scheduler UI does not show correctly
[ https://issues.apache.org/jira/browse/YARN-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang resolved YARN-6044. Resolution: Duplicate > Resource bar of Capacity Scheduler UI does not show correctly > - > > Key: YARN-6044 > URL: https://issues.apache.org/jira/browse/YARN-6044 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Priority: Minor > > Test Environment: > 1. NodeLable > yarn rmadmin -addToClusterNodeLabels "label1(exclusive=false)" > 2. capacity-scheduler.xml > yarn.scheduler.capacity.root.queues=a,b > yarn.scheduler.capacity.root.a.capacity=60 > yarn.scheduler.capacity.root.b.capacity=40 > yarn.scheduler.capacity.root.a.accessible-node-labels=label1 > yarn.scheduler.capacity.root.accessible-node-labels.label1.capacity=100 > yarn.scheduler.capacity.root.a.accessible-node-labels.label1.capacity=100 > In this test case, for queue(root.b) in partition(label1), the resource > bar(represents absolute-max-capacity) should be 100%(default). The scheduler > UI shows correctly after RM started, but when I started an app in > queue(root.b) and partition(label1) , the resource bar of this queue is > changed from 100% to 0%. > For corrent queue(root.a), the queueCapacities of partition(label1) was > inited in ParentQueue/LeafQueue constructor and > max-capacity/absolute-max-capacity were setted with correct value, due to > yarn.scheduler.capacity.root.a.accessible-node-labels is defined in > capacity-scheduler.xml > For incorrent queue(root.b), the queueCapacities of partition(label1) didn't > exist at first, the max-capacity and absolute-max-capacity were setted with > default value(100%) in PartitionQueueCapacitiesInfo so that Scheduler UI > could show correctly. When this queue was allocating resource for > partition(label1), the queueCapacities of partition(label1) was created and > only used-capacity and absolute-used-capacity were setted in > AbstractCSQueue#allocateResource. max-capacity and absolute-max-capacity have > to use float default value 0 which are defined in QueueCapacities$Capacities. > Whether max-capacity and absolute-max-capacity should have default > value(100%) in Capacities constructor to avoid losing default value if > somewhere called not given? > Please feel free to give your suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6044) Resource bar of Capacity Scheduler UI does not show correctly
[ https://issues.apache.org/jira/browse/YARN-6044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124767#comment-16124767 ] Tao Yang commented on YARN-6044: Thanks [~djp] and [~sunilg] for your reply. The solution makes sense to me. > Resource bar of Capacity Scheduler UI does not show correctly > - > > Key: YARN-6044 > URL: https://issues.apache.org/jira/browse/YARN-6044 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.0 >Reporter: Tao Yang >Priority: Minor > > Test Environment: > 1. NodeLable > yarn rmadmin -addToClusterNodeLabels "label1(exclusive=false)" > 2. capacity-scheduler.xml > yarn.scheduler.capacity.root.queues=a,b > yarn.scheduler.capacity.root.a.capacity=60 > yarn.scheduler.capacity.root.b.capacity=40 > yarn.scheduler.capacity.root.a.accessible-node-labels=label1 > yarn.scheduler.capacity.root.accessible-node-labels.label1.capacity=100 > yarn.scheduler.capacity.root.a.accessible-node-labels.label1.capacity=100 > In this test case, for queue(root.b) in partition(label1), the resource > bar(represents absolute-max-capacity) should be 100%(default). The scheduler > UI shows correctly after RM started, but when I started an app in > queue(root.b) and partition(label1) , the resource bar of this queue is > changed from 100% to 0%. > For corrent queue(root.a), the queueCapacities of partition(label1) was > inited in ParentQueue/LeafQueue constructor and > max-capacity/absolute-max-capacity were setted with correct value, due to > yarn.scheduler.capacity.root.a.accessible-node-labels is defined in > capacity-scheduler.xml > For incorrent queue(root.b), the queueCapacities of partition(label1) didn't > exist at first, the max-capacity and absolute-max-capacity were setted with > default value(100%) in PartitionQueueCapacitiesInfo so that Scheduler UI > could show correctly. When this queue was allocating resource for > partition(label1), the queueCapacities of partition(label1) was created and > only used-capacity and absolute-used-capacity were setted in > AbstractCSQueue#allocateResource. max-capacity and absolute-max-capacity have > to use float default value 0 which are defined in QueueCapacities$Capacities. > Whether max-capacity and absolute-max-capacity should have default > value(100%) in Capacities constructor to avoid losing default value if > somewhere called not given? > Please feel free to give your suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6629: --- Attachment: YARN-6629.002.patch Uploaded a new patch with test case. > NPE occurred when container allocation proposal is applied but its resource > requests are removed before > --- > > Key: YARN-6629 > URL: https://issues.apache.org/jira/browse/YARN-6629 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6629.001.patch, YARN-6629.002.patch > > > I wrote a test case to reproduce another problem for branch-2 and found new > NPE error, log: > {code} > FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) > at > org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) > at > org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) > at org.mockito.internal.MockHandler.handle(MockHandler.java:97) > at > org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > {code} > Reproduce this error in chronological order: > 1. AM started and requested 1 container with schedulerRequestKey#1 : > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests > Added schedulerRequestKey#1 into schedulerKeyToPlacementSets > 2. Scheduler allocatd 1 container for this request and accepted the proposal > 3. AM removed this request > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests --> > AppSchedulingInfo#addToPlacementSets --> > AppSchedulingInfo#updatePendingResources > Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) > 4. Scheduler applied this proposal > CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> > AppSchedulingInfo#allocate > Throw NPE when called > schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, > type, node); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124569#comment-16124569 ] Tao Yang commented on YARN-6629: Sorry for the late reply. Thanks [~sunilg] for reviewing this issue. Yes, It's happening in trunk as well. I will write a test case and update the patch later. > NPE occurred when container allocation proposal is applied but its resource > requests are removed before > --- > > Key: YARN-6629 > URL: https://issues.apache.org/jira/browse/YARN-6629 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6629.001.patch > > > I wrote a test case to reproduce another problem for branch-2 and found new > NPE error, log: > {code} > FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) > at > org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) > at > org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) > at org.mockito.internal.MockHandler.handle(MockHandler.java:97) > at > org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > {code} > Reproduce this error in chronological order: > 1. AM started and requested 1 container with schedulerRequestKey#1 : > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests > Added schedulerRequestKey#1 into schedulerKeyToPlacementSets > 2. Scheduler allocatd 1 container for this request and accepted the proposal > 3. AM removed this request > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests --> > AppSchedulingInfo#addToPlacementSets --> > AppSchedulingInfo#updatePendingResources > Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) > 4. Scheduler applied this proposal > CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> > AppSchedulingInfo#allocate > Throw NPE when called > schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, > type, node); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key
[ https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124563#comment-16124563 ] Tao Yang commented on YARN-6257: [~leftnoteasy], thanks for the reply. Yes, duplicated keys in JSON object is completely unconsumable by clients. Take the parse results with different json-libs for example, we will get JSONException(Duplicated Key ...) if using org.json, and will get the last entry(lose other entries) if use org.codehaus.jettison > CapacityScheduler REST API produces incorrect JSON - JSON object > operationsInfo contains deplicate key > -- > > Key: YARN-6257 > URL: https://issues.apache.org/jira/browse/YARN-6257 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-6257.001.patch > > > In response string of CapacityScheduler REST API, > scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a > JSON object : > {code} > "operationsInfo":{ > > "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}} > } > {code} > To solve this problem, I suppose the type of operationsInfo field in > CapacitySchedulerHealthInfo class should be converted from Map to List. > After convert to List, The operationsInfo string will be: > {code} > "operationInfos":[ > > {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"} > ] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6257) CapacityScheduler REST API produces incorrect JSON - JSON object operationsInfo contains deplicate key
[ https://issues.apache.org/jira/browse/YARN-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122755#comment-16122755 ] Tao Yang commented on YARN-6257: This problem was imported by YARN-3293 (2.8.0+). The operationsInfo can't be correctly used before as it's not follow JSON format. [~vvasudev], Please help to review this issue. > CapacityScheduler REST API produces incorrect JSON - JSON object > operationsInfo contains deplicate key > -- > > Key: YARN-6257 > URL: https://issues.apache.org/jira/browse/YARN-6257 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-6257.001.patch > > > In response string of CapacityScheduler REST API, > scheduler/schedulerInfo/health/operationsInfo have duplicate key 'entry' as a > JSON object : > {code} > "operationsInfo":{ > > "entry":{"key":"last-preemption","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-reservation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-allocation","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}}, > > "entry":{"key":"last-release","value":{"nodeId":"N/A","containerId":"N/A","queue":"N/A"}} > } > {code} > To solve this problem, I suppose the type of operationsInfo field in > CapacitySchedulerHealthInfo class should be converted from Map to List. > After convert to List, The operationsInfo string will be: > {code} > "operationInfos":[ > > {"operation":"last-allocation","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-release","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-preemption","nodeId":"N/A","containerId":"N/A","queue":"N/A"}, > > {"operation":"last-reservation","nodeId":"N/A","containerId":"N/A","queue":"N/A"} > ] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.branch-2.005.patch Attached branch-2 patch for cleanly applying. Thanks [~sunilg] and [~leftnoteasy] for commits and reviews. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch, YARN-6678.005.patch, > YARN-6678.branch-2.005.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.branch-2.006.patch > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch, > YARN-6714.branch-2.006.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.005.patch Sure, upload new patch to resolve the conflict with YARN-6714 in TestCapacitySchedulerAsyncScheduling. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch, YARN-6678.005.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.branch-2.005.patch > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087164#comment-16087164 ] Tao Yang edited comment on YARN-6714 at 7/14/17 2:14 PM: - Sorry to have misplaced the actual types, and there are more custom generic types should be explicitly specified. Upload a new patch. was (Author: tao yang): Sorry to have misplaced the actual types. Upload a new patch. > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: (was: YARN-6714.branch-2.005.patch) > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.branch-2.005.patch Sorry to have misplaced the actual types. Upload a new patch. > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.branch-2.004.patch For the check javac warning, It seems that the custom generic types of SchedulerContainer should be explicitly specified while creating a new instance in branch-2. Upload new patch to add the actual types: SchedulerContainerreservedContainer = ... > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch, > YARN-6714.branch-2.004.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.branch-2.003.patch Upload a patch for branch-2. Thanks [~sunilg] for review and committing. > IllegalStateException while handling APP_ATTEMPT_REMOVED event when > async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch, YARN-6714.branch-2.003.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083824#comment-16083824 ] Tao Yang commented on YARN-6678: I confirmed that it's fine. Thanks [~sunilg] for your help. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: (was: YARN-6678.004.patch) > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.004.patch > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.004.patch Thanks [~sunilg] for your time. As your mentioned, This new patch adds timeout for every where clause, adds nodeId for debug info, and calls MockRM#stop at last of new test case. TestCapacitySchedulerAsyncScheduling can be passed now. Sorry to be late for updating this patch. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch, YARN-6678.004.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6737) Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler
Tao Yang created YARN-6737: -- Summary: Rename getApplicationAttempt to getCurrentAttempt in AbstractYarnScheduler/CapacityScheduler Key: YARN-6737 URL: https://issues.apache.org/jira/browse/YARN-6737 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.0.0-alpha3, 2.9.0 Reporter: Tao Yang Priority: Minor As discussed in YARN-6714 (https://issues.apache.org/jira/browse/YARN-6714?focusedCommentId=16052158=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16052158) AbstractYarnScheduler#getApplicationAttempt is inconsistent to its name, it discarded application_attempt_id and always return the latest attempt. We should: 1) Rename it to getCurrentAttempt, 2) Change parameter from attemptId to applicationId. 3) Took a scan of all usages to see if any similar issue could happen. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.003.patch Update the patch with adding comments for sanity check of attemptId. Thanks [~sunilg] for your suggestion. > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, > YARN-6714.003.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060372#comment-16060372 ] Tao Yang edited comment on YARN-6678 at 6/23/17 4:11 AM: - Thanks [~sunilg] for your comments. {quote} 1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of using != {quote} As [~leftnoteasy] mentioned, it should be enough to use == to compare two instances. Are there some other concerns about this? I noticed that this patch caused several failed tests, but these are all passed when I run it locally. What might be the problem? was (Author: tao yang): Thanks [~sunilg] for your comments. {quote} 1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of using != {quote} As [~leftnoteasy] mentioned, it should be enough to use == to compare two instances. Are there some other concerns about this? > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060372#comment-16060372 ] Tao Yang commented on YARN-6678: Thanks [~sunilg] for your comments. {quote} 1. In FiCaSchedulerApp#accept, its better to use RMContainer#equals instead of using != {quote} As [~leftnoteasy] mentioned, it should be enough to use == to compare two instances. Are there some other concerns about this? > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.003.patch Updated the patch without adding new method to CapacityScheduler. Thanks [~leftnoteasy] for your suggestion, it's fine to only change the spy target for the test case. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch, > YARN-6678.003.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.002.patch Updated the patch with moving test case to TestCapacitySchedulerAsyncScheduling. > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413 ] Tao Yang edited comment on YARN-6714 at 6/20/17 9:47 AM: - Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} I remembered why these test cases are not in TestCapacitySchedulerAsyncScheduling before, these cases are complex and hard to reproduce when async-scheduling enabled, for example, it's hard to allocate multiple containers as we need. Can I move these test cases to TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ? {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D was (Author: tao yang): Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} I remembered why these test cases are not in TestCapacitySchedulerAsyncScheduling before, these cases is complex and hard to reproduce when async-scheduling enabled. Can I move these test cases to TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ? {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413 ] Tao Yang edited comment on YARN-6714 at 6/20/17 9:38 AM: - Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} I remembered why these test cases are not in TestCapacitySchedulerAsyncScheduling before, these cases is complex and hard to reproduce when async-scheduling enabled. Can I move these test cases to TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ? {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D was (Author: tao yang): Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} Sure, I will update the patch later for this and YARN-6678. {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413 ] Tao Yang edited comment on YARN-6714 at 6/20/17 9:08 AM: - Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} Sure, I will update the patch later for this and YARN-6678. {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D was (Author: tao yang): Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} Sure, I will update the patch later for this and YARN-6678. {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on this :D > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055413#comment-16055413 ] Tao Yang commented on YARN-6714: Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} Sure, I will update the patch later for this and YARN-6678. {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on this :D > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055310#comment-16055310 ] Tao Yang commented on YARN-6678: Thanks [~leftnoteasy] for your comments. {quote} Instead of using RmContainer().equals, it should be enough to use == to compare two instances, correct? {quote} Correct, just noticed that as you mentioned. {quote} is there any other way to avoid adding the new method to CapacityScheduler? {quote} It's necessary to add new method if spy on app attempt. I'll try to find another way to test this problem, for example, spy on CapacityScheduler instance > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.002.patch > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: (was: YARN-6678.002.patch) > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --- Attachment: YARN-6714.001.patch Attach a patch for review > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > - > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_02 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6714) RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler
Tao Yang created YARN-6714: -- Summary: RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler Key: YARN-6714 URL: https://issues.apache.org/jira/browse/YARN-6714 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3, 2.9.0 Reporter: Tao Yang Assignee: Tao Yang Currently in async-scheduling mode of CapacityScheduler, after AM failover and unreserve all reserved containers, it still have chance to get and commit the outdated reserve proposal of the failed app attempt. This problem happened on an app in our cluster, when this app stopped, it unreserved all reserved containers and compared these appAttemptId with current appAttemptId, if not match it will throw IllegalStateException and make RM crashed. Error log: {noformat} 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler java.lang.IllegalStateException: Trying to unreserve for application appattempt_1495188831758_0121_02 when currently reserved for application application_1495188831758_0121 on node host: node1:45454 #containers=2 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) at java.lang.Thread.run(Thread.java:834) {noformat} When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and CapacityScheduler#tryCommit both need to get write_lock before executing, so we can check the app attempt state in commit process to avoid committing outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.002.patch > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch, YARN-6678.002.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 > 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container. > We should confirm that reserved container on this node is equal to re-reserve > container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Description: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Try to reserve a container, but the node is " + "already reserved by another container=" + schedulerContainer.getSchedulerNode() .getReservedContainer().getContainerId()); } return false; } } {code} The reserved container on the node of reserve proposal will be checked only for first-reserve container. We should confirm that reserved container on this node is equal to re-reserve container. was: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) { if (LOG.isDebugEnabled()) {
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Description: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Try to reserve a container, but the node is " + "already reserved by another container=" + schedulerContainer.getSchedulerNode() .getReservedContainer().getContainerId()); } return false; } } {code} The reserved container on the node of reserve proposal will be checked only for first-reserve container, not for the re-reserve container. We could check reserved container on this node with re-reserve container to avoid this problem. was: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) {
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Description: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserve proposal-1 2. nm2 had enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Try to reserve a container, but the node is " + "already reserved by another container=" + schedulerContainer.getSchedulerNode() .getReservedContainer().getContainerId()); } return false; } } {code} The reserved container on the node of reserve proposal will be checked only for first-reserve container, not for the re-reserve container. I think FiCaSchedulerApp#accept should do this check for all reserve proposal not matter if the container is re-reserve or not. was: Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer()
[jira] [Assigned] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang reassigned YARN-6678: -- Assignee: Tao Yang > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6678.001.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1 > 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container, not for the re-reserve container. > I think FiCaSchedulerApp#accept should do this check for all reserve proposal > not matter if the container is re-reserve or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6678: --- Attachment: YARN-6678.001.patch Attach a patch with UT for review. > Committer thread crashes with IllegalStateException in async-scheduling mode > of CapacityScheduler > - > > Key: YARN-6678 > URL: https://issues.apache.org/jira/browse/YARN-6678 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Tao Yang > Attachments: YARN-6678.001.patch > > > Error log: > {noformat} > java.lang.IllegalStateException: Trying to reserve container > container_e10_1495599791406_7129_01_001453 for application > appattempt_1495599791406_7129_01 when currently reserved container > container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 > #containers=40 available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) > {noformat} > Reproduce this problem: > 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1 > 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and > allocated app-1/container-X2 > 3. nm1 reserved app-2/container-Y > 4. proposal-1 was accepted but throw IllegalStateException when applying > Currently the check code for reserve proposal in FiCaSchedulerApp#accept as > follows: > {code} > // Container reserved first time will be NEW, after the container > // accepted & confirmed, it will become RESERVED state > if (schedulerContainer.getRmContainer().getState() > == RMContainerState.RESERVED) { > // Set reReservation == true > reReservation = true; > } else { > // When reserve a resource (state == NEW is for new container, > // state == RUNNING is for increase container). > // Just check if the node is not already reserved by someone > if (schedulerContainer.getSchedulerNode().getReservedContainer() > != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Try to reserve a container, but the node is " > + "already reserved by another container=" > + schedulerContainer.getSchedulerNode() > .getReservedContainer().getContainerId()); > } > return false; > } > } > {code} > The reserved container on the node of reserve proposal will be checked only > for first-reserve container, not for the re-reserve container. > I think FiCaSchedulerApp#accept should do this check for all reserve proposal > not matter if the container is re-reserve or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6678) Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler
Tao Yang created YARN-6678: -- Summary: Committer thread crashes with IllegalStateException in async-scheduling mode of CapacityScheduler Key: YARN-6678 URL: https://issues.apache.org/jira/browse/YARN-6678 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0-alpha3, 2.9.0 Reporter: Tao Yang Error log: {noformat} java.lang.IllegalStateException: Trying to reserve container container_e10_1495599791406_7129_01_001453 for application appattempt_1495599791406_7129_01 when currently reserved container container_e10_1495599791406_7123_01_001513 on node host: node0123:45454 #containers=40 available=... used=... at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.reserveResource(FiCaSchedulerNode.java:81) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.reserve(FiCaSchedulerApp.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:795) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2770) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:546) {noformat} Reproduce this problem: 1. nm1 re-reserved app-1/container-X1 and generated reserved proposal-1 2. nm2 has enough resource for app-1, un-reserved app-1/container-X1 and allocated app-1/container-X2 3. nm1 reserved app-2/container-Y 4. proposal-1 was accepted but throw IllegalStateException when applying Currently the check code for reserve proposal in FiCaSchedulerApp#accept as follows: {code} // Container reserved first time will be NEW, after the container // accepted & confirmed, it will become RESERVED state if (schedulerContainer.getRmContainer().getState() == RMContainerState.RESERVED) { // Set reReservation == true reReservation = true; } else { // When reserve a resource (state == NEW is for new container, // state == RUNNING is for increase container). // Just check if the node is not already reserved by someone if (schedulerContainer.getSchedulerNode().getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Try to reserve a container, but the node is " + "already reserved by another container=" + schedulerContainer.getSchedulerNode() .getReservedContainer().getContainerId()); } return false; } } {code} The reserved container on the node of reserve proposal will be checked only for first-reserve container, not for the re-reserve container. I think FiCaSchedulerApp#accept should do this check for all reserve proposal not matter if the container is re-reserve or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6629: --- Description: I wrote a test case to reproduce another problem for branch-2 and found new NPE error, log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) {code} Reproduce this error in chronological order: 1. AM started and requested 1 container with schedulerRequestKey#1 : ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests Added schedulerRequestKey#1 into schedulerKeyToPlacementSets 2. Scheduler allocatd 1 container for this request and accepted the proposal 3. AM removed this request ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests --> AppSchedulingInfo#addToPlacementSets --> AppSchedulingInfo#updatePendingResources Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) 4. Scheduler applied this proposal CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> AppSchedulingInfo#allocate Throw NPE when called schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, type, node); was: I wrote a test case to reproduce another problem for branch-2 and found new NPE error, log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at
[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6629: --- Description: I wrote a test case to reproduce another problem for branch-2 and found new NPE error, log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) {code} Reproduce this error in chronological order: 1. AM started and requested 1 container with schedulerRequestKey#1 : ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests Added schedulerRequestKey#1 into schedulerKeyToPlacementSets 2. Scheduler allocatd 1 container for this request and accepted the proposal 3. AM removed this request ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests --> AppSchedulingInfo#addToPlacementSets --> AppSchedulingInfo#updatePendingResources Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) 4. Scheduler applied this proposal and wanted to deduct the pending resource CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> AppSchedulingInfo#allocate Throw NPE when called schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, type, node); was: I wrote a test case to test other problem for branch-2 and found new NPE error, log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at
[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6629: --- Description: I wrote a test case to test other problem for branch-2 and found new NPE error, log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) {code} Reproduce this error in chronological order: 1. AM started and requested 1 container with schedulerRequestKey#1 : ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests Added schedulerRequestKey#1 into schedulerKeyToPlacementSets 2. Scheduler allocatd 1 container for this request and accepted the proposal 3. AM removed this request ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests --> AppSchedulingInfo#addToPlacementSets --> AppSchedulingInfo#updatePendingResources Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) 4. Scheduler applied this proposal and wanted to deduct the pending resource CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> AppSchedulingInfo#allocate Throw NPE when called schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, type, node); was: Error log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at
[jira] [Updated] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
[ https://issues.apache.org/jira/browse/YARN-6629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6629: --- Attachment: YARN-6629.001.patch Attach a patch for review. > NPE occurred when container allocation proposal is applied but its resource > requests are removed before > --- > > Key: YARN-6629 > URL: https://issues.apache.org/jira/browse/YARN-6629 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6629.001.patch > > > Error log: > {code} > FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in > handling event type NODE_UPDATE to the Event Dispatcher > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) > at > org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) > at > org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) > at org.mockito.internal.MockHandler.handle(MockHandler.java:97) > at > org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:745) > {code} > Reproduce this error in chronological order: > 1. AM started and requested 1 container with schedulerRequestKey#1 : > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests > Added schedulerRequestKey#1 into schedulerKeyToPlacementSets > 2. Scheduler allocatd 1 container for this request and accepted the proposal > 3. AM removed this request > ApplicationMasterService#allocate --> CapacityScheduler#allocate --> > SchedulerApplicationAttempt#updateResourceRequests --> > AppSchedulingInfo#updateResourceRequests --> > AppSchedulingInfo#addToPlacementSets --> > AppSchedulingInfo#updatePendingResources > Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) > 4. Scheduler applied this proposal and wanted to deduct the pending resource > CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> > AppSchedulingInfo#allocate > Throw NPE when called > schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, > type, node); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-6629) NPE occurred when container allocation proposal is applied but its resource requests are removed before
Tao Yang created YARN-6629: -- Summary: NPE occurred when container allocation proposal is applied but its resource requests are removed before Key: YARN-6629 URL: https://issues.apache.org/jira/browse/YARN-6629 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha2, 2.9.0 Reporter: Tao Yang Assignee: Tao Yang Error log: {code} FATAL event.EventDispatcher (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:446) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:516) at org.apache.hadoop.yarn.client.TestNegativePendingResource$1.answer(TestNegativePendingResource.java:225) at org.mockito.internal.stubbing.StubbedInvocationMatcher.answer(StubbedInvocationMatcher.java:31) at org.mockito.internal.MockHandler.handle(MockHandler.java:97) at org.mockito.internal.creation.MethodInterceptorFilter.intercept(MethodInterceptorFilter.java:47) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp$$EnhancerByMockitoWithCGLIB$$29eb8afc.apply() at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.submitResourceCommitRequest(CapacityScheduler.java:2281) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1247) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1236) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1325) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1112) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:987) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1367) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:143) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) {code} Reproduce this error in chronological order: 1. AM started and requested 1 container with schedulerRequestKey#1 : ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests Added schedulerRequestKey#1 into schedulerKeyToPlacementSets 2. Scheduler allocatd 1 container for this request and accepted the proposal 3. AM removed this request ApplicationMasterService#allocate --> CapacityScheduler#allocate --> SchedulerApplicationAttempt#updateResourceRequests --> AppSchedulingInfo#updateResourceRequests --> AppSchedulingInfo#addToPlacementSets --> AppSchedulingInfo#updatePendingResources Removed schedulerRequestKey#1 from schedulerKeyToPlacementSets) 4. Scheduler applied this proposal and wanted to deduct the pending resource CapacityScheduler#tryCommit --> FiCaSchedulerApp#apply --> AppSchedulingInfo#allocate Throw NPE when called schedulerKeyToPlacementSets.get(schedulerRequestKey).allocate(schedulerKey, type, node); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958249#comment-15958249 ] Tao Yang commented on YARN-6403: [~jlowe], thanks for review and committing! > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3 > > Attachments: YARN-6403.001.patch, YARN-6403.002.patch, > YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, > YARN-6403.branch-2.8.004.patch, YARN-6403.branch-2.8.004.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --- Attachment: YARN-6403.004.patch YARN-6403.branch-2.8.004.patch Thanks [~jlowe] for your suggestions. Client-side test is moved to TestApplicationClientProtocolRecords now and TestContainerManagerWithLCE is updated to avoid failure. Attach new patches for branch-2.8 and trunk. > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch, > YARN-6403.004.patch, YARN-6403.branch-2.8.003.patch, > YARN-6403.branch-2.8.004.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --- Attachment: YARN-6403.branch-2.8.003.patch Attach new patch for branch-2.8 > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch, > YARN-6403.branch-2.8.003.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950419#comment-15950419 ] Tao Yang commented on YARN-6403: [~jlowe] Thanks for your time! {quote} I believe it's appropriate to throw NPE in our client check code as well rather than a generic RuntimeException. It's a minor point since the net effect will be similar for the client in either case. {quote} Make sense, sorry for missing the point before. {quote} TestApplicationClientProtocolRecords looks like a decent place since it's already has another test for ContainerLaunchContextPBImpl there. {quote} TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it ok to place the UT for client-side in TestPBImplRecords#testContainerLaunchContextPBImpl? In addition, the error message and unit test code will be improved in next patch. One patch can't fit for all branches, perhaps it's necessary to submit patches for 2.9(branch-2) and 3.0.0-alpha3(trunk)? > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --- Attachment: YARN-6403.002.patch [~jlowe] Thanks for correcting me. The last server-side change is not proper and I corrected it as your mentioned. For the client-side change, IIUIC the generated protobuf code won't throws NPE for this case actually. Unit tests for both the client and server change is added. Attach a new patch for review, please correct me if I missed something. > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --- Attachment: YARN-6403.001.patch Attach a patch for review. * Add local resources check in ContainerImpl$RequestResourcesTransition to avoid NM failing, the container with invalid resource will fail to launch in this step. * Add local resources check in ContainerLaunchContextPBImpl#setLocalResources to fail the app with invalid resource early in client, as it's a waste for cluster to launch a bound-to-fail app. > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang > Attachments: YARN-6403.001.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946419#comment-15946419 ] Tao Yang commented on YARN-6403: [~Naganarasimha] Yes, I would like to work on this and will submit a patch for review soon. > Invalid local resource request can raise NPE and make NM exit > - > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.0 >Reporter: Tao Yang > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6403: --- Description: Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this: {code} localResources.put("test", LocalResource.newInstance(location, LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, System.currentTimeMillis())); ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(localResources, environment, vargsFinal, null, securityTokens, acls); {code} The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. {code} FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) {code} NPE occured when created LocalResourceRequest instance for invalid resource request. {code} public LocalResourceRequest(LocalResource resource) throws URISyntaxException { this(resource.getResource().toPath(), //NPE occurred here resource.getTimestamp(), resource.getType(), resource.getVisibility(), resource.getPattern()); } {code} We can't guarantee the validity of local resource request now, but we could avoid damaging the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest? Please feel free to give your suggestions. was: Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this: {code} localResources.put("test", LocalResource.newInstance(location, LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, System.currentTimeMillis())); ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(localResources, environment, vargsFinal, null, securityTokens, acls); {code} The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. {code} java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at
[jira] [Created] (YARN-6403) Invalid local resource request can raise NPE and make NM exit
Tao Yang created YARN-6403: -- Summary: Invalid local resource request can raise NPE and make NM exit Key: YARN-6403 URL: https://issues.apache.org/jira/browse/YARN-6403 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.0 Reporter: Tao Yang Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this: {code} localResources.put("test", LocalResource.newInstance(location, LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, System.currentTimeMillis())); ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(localResources, environment, vargsFinal, null, securityTokens, acls); {code} The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. {code} java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.(LocalResourceRequest.java:46) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) {code} NPE occured when created LocalResourceRequest instance for invalid resource request. {code} public LocalResourceRequest(LocalResource resource) throws URISyntaxException { this(resource.getResource().toPath(), //NPE occurred here resource.getTimestamp(), resource.getType(), resource.getVisibility(), resource.getPattern()); } {code} We can't guarantee the validity of local resource request now, but we could avoid damaging the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest? Please feel free to give your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6259) Support pagination and optimize data transfer with zero-copy approach for containerlogs REST API in NMWebServices
[ https://issues.apache.org/jira/browse/YARN-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891423#comment-15891423 ] Tao Yang commented on YARN-6259: Hi, [~rohithsharma]. Thank you for looking into this issue. {quote} I am not sure about how use cases will be served {quote} One common use case is to request last part of log and easily skip to another part for detecting problem, instead of loading the entire log, it perhaps can save a lot of time. We have an outer system to track apps and show container logs inside, meanwhile most of logs are very large, so that pagination function is needed and the newly added containerlogs-info REST API is a part of it. {quote} Instead of adding new LogInfo file, there is ContainerLogInfo file which can be used for pageSize and pageIndex. {quote} ContainerLogInfo seems not exist in branch-2.8, perhaps it's for higher version? > Support pagination and optimize data transfer with zero-copy approach for > containerlogs REST API in NMWebServices > - > > Key: YARN-6259 > URL: https://issues.apache.org/jira/browse/YARN-6259 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.1 >Reporter: Tao Yang >Assignee: Tao Yang > Attachments: YARN-6259.001.patch > > > Currently containerlogs REST API in NMWebServices will read and send the > entire content of container logs. Most of container logs are large and it's > useful to support pagination. > * Add pagesize and pageindex parameters for containerlogs REST API > {code} > URL: http:///ws/v1/node/containerlogs// > QueryParams: > pagesize - max bytes of one page , default 1MB > pageindex - index of required page, default 0, can be nagative(set -1 will > get the last page content) > {code} > * Add containerlogs-info REST API since sometimes we need to know the > totalSize/pageSize/pageCount info of log > {code} > URL: > http:///ws/v1/node/containerlogs-info// > QueryParams: > pagesize - max bytes of one page , default 1MB > Response example: > {"logInfo":{"totalSize":2497280,"pageSize":1048576,"pageCount":3}} > {code} > Moreover, the data transfer pipeline (disk --> read buffer --> NM buffer --> > socket buffer) can be optimized to pipeline(disk --> read buffer --> socket > buffer) with zero-copy approach. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org