[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298747#comment-16298747 ] Eric Yang edited comment on YARN-7565 at 12/20/17 5:33 PM: --- Thank you for point out the ServiceRecord.description maps to container name (and not Service Spec description field), but it appears to be a race condition for newly created application. serviceStart is invoked recoverComponent first. Application hasn't registered with Registry yet. This looks like the reason that we get null pointer exception. was (Author: eyang): Thank you for point out the record.description maps to container name, but it appears to be a race condition for newly created application. serviceStart is invoked recoverComponent first. Application hasn't registered with Registry yet. This looks like the reason that we get null pointer exception. > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch, > YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, > YARN-7565.addendum.001.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286746#comment-16286746 ] Chandni Singh edited comment on YARN-7565 at 12/11/17 11:33 PM: bq. yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout ? How about yarn.service.container-recovery.timeout? "am-recovery.timeout" gives an impression that there is a timeout for AM recovery. was (Author: csingh): > yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout > ? How about yarn.service.container-recovery.timeout? "am-recovery.timeout" gives an impression that there is a timeout for AM recovery. > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: yarn-native-services > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch, > YARN-7565.003.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285539#comment-16285539 ] Jian He edited comment on YARN-7565 at 12/11/17 6:00 AM: - - yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout ? - better move ServiceScheduler.YARN_COMPONENT to YarnRegistryAttributes as well - this logic seems a bit confusing. It sends a recover event, but adds to unRecoveredInstances which looks contradictoy by naming. All that needed is to call "component.pendingInstances.remove(instance);". How about just expose this method as an API ? rather than sending the event ? {code} if (componentName != null) { Component component = componentsByName.get(componentName); ComponentInstance compInstance = component.getComponentInstance( record.description); ContainerId containerId = ContainerId.fromString(record.get( YarnRegistryAttributes.YARN_ID)); ComponentEvent event = new ComponentEvent(componentName, CONTAINER_RECOVERED) .setInstance(compInstance) .setContainerId(containerId); unRecoveredInstances.put(containerId, compInstance); component.handle(event); } {code} - Similary, ContainerRecoveryFailedTransition can be removed, just call the existing reInsertPendingInstance API instead (rename it to insertPendingInstance) - add componentInstanceId to the log like below {code} LOG.info("{}, Wait on container {} expired", entry.getValue().getCompInstanceId, entry.getKey()); {code} - Looks like it only adds to the pending instance list when recover timeouts, do we need to re-request the container ? {code} if (unRecoveredInstances.size() > 0) { executorService.schedule(() -> { // after containerWaitMs, all the containers that haven't be recovered // by RM will released. The corresponding Component Instances will be // adding to the pending queues of their respective component. Iterator> iterator = unRecoveredInstances.entrySet().iterator(); while (iterator.hasNext()) { Map.Entry entry = iterator.next(); iterator.remove(); LOG.info("{} Wait on container {} expired", entry.getValue().getCompInstanceId(), entry.getKey()); String componentName = entry.getValue().getCompName(); ComponentEvent event = new ComponentEvent( componentName, CONTAINER_RECOVERY_FAILED) .setInstance(entry.getValue()) .setContainerId(entry.getKey()); componentsByName.get(componentName).handle(event); amRMClient.releaseAssignedContainer(entry.getKey()); } }, containerWaitMs, TimeUnit.MILLISECONDS); {code} - Clear the unRecoveredInstances in the end of this executor thread. it'll remain in memory forever otherwise. - I think there may be a race condition where additional pending instance got added. 1. In the executor thread loop, get the unRecovered instance object (say, instance1) 2. In onContainersReceivedFromPreviousAttempts: remove unrecovered instance1 from the map 3. In onContainersReceivedFromPreviousAttempts: send recover event 4. In Component: to handle the recover event, remove from pending instance list 5. But then Executor thread add (instance1) back to the pending list We probably need to synchronize on the entire unRecoveredInstance map ? was (Author: jianhe): - yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout ? - better move ServiceScheduler.YARN_COMPONENT to YarnRegistryAttributes as well - this logic seems a bit confusing. It sends a recover event, but adds to unRecoveredInstances which looks contradictoy by naming. All that needed is to call "component.pendingInstances.remove(instance);". How about just expose this method as an API ? rather than sending the event ? {code} if (componentName != null) { Component component = componentsByName.get(componentName); ComponentInstance compInstance = component.getComponentInstance( record.description); ContainerId containerId = ContainerId.fromString(record.get( YarnRegistryAttributes.YARN_ID)); ComponentEvent event = new ComponentEvent(componentName, CONTAINER_RECOVERED) .setInstance(compInstance) .setContainerId(containerId); unRecoveredInstances.put(containerId, compInstance); component.handle(event); } {code} - Similary, ContainerRecoveryFailedTransition can be removed, just call the above addPendingInstance API instead - add componentInstanceId to the log like below {code} LOG.info("{}, Wait on container {} expired", entry.getValue().getCompInstanceId, entry.getKey()); {code} - Looks like it only adds to the pending instance list when re
[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279381#comment-16279381 ] Chandni Singh edited comment on YARN-7565 at 12/6/17 12:26 AM: --- Patch 2 - Addressed [~jianhe]'s comments - Instead of checking repeatedly if containers are recovered, made the change in ServiceScheduler to check just once if the unRecovered containers are still there. If they haven't been recovered by then, then the container is released and component instance is added to pending queue. - Added another test was (Author: csingh): Patch 2 - Addressed [~jianhe]'s comments - Instead of checking repeatedly if containers are recovered, made the change in ServiceScheduler to check just once if the unRecovered containers are available. - Added another test > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: yarn-native-services > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org