[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285539#comment-16285539
 ] 

Jian He commented on YARN-7565:
-------------------------------

- yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout 
?
- better move ServiceScheduler.YARN_COMPONENT to YarnRegistryAttributes as well
- this logic seems a bit confusing. It sends a recover event, but adds to 
unRecoveredInstances which looks contradictoy by naming. 
All that needed is to call "component.pendingInstances.remove(instance);". How 
about just expose this method as an API ? rather than sending the event ?
{code}
      if (componentName != null) {
        Component component = componentsByName.get(componentName);
        ComponentInstance compInstance = component.getComponentInstance(
            record.description);
        ContainerId containerId = ContainerId.fromString(record.get(
            YarnRegistryAttributes.YARN_ID));
        ComponentEvent event = new ComponentEvent(componentName,
            CONTAINER_RECOVERED)
            .setInstance(compInstance)
            .setContainerId(containerId);
        unRecoveredInstances.put(containerId, compInstance);
        component.handle(event);
      }
{code}
- Similary, ContainerRecoveryFailedTransition can be removed, just call the 
above addPendingInstance API instead
- add componentInstanceId to the log like below 
{code}
          LOG.info("{}, Wait on container {} expired", 
entry.getValue().getCompInstanceId, entry.getKey());
{code}
- Looks like it only adds to the pending instance list when recover timeouts, 
do we need to re-request the container ? 
{code}
    if (unRecoveredInstances.size() > 0) {
      executorService.schedule(() -> {
        // after containerWaitMs, all the containers that haven't be recovered
        // by RM will released. The corresponding Component Instances will be
        // adding to the pending queues of their respective component.
        Iterator<Map.Entry<ContainerId, ComponentInstance>> iterator =
            unRecoveredInstances.entrySet().iterator();
        while (iterator.hasNext()) {
          Map.Entry<ContainerId, ComponentInstance> entry = iterator.next();
          iterator.remove();
          LOG.info("{} Wait on container {} expired", 
entry.getValue().getCompInstanceId(), entry.getKey());
          String componentName = entry.getValue().getCompName();
          ComponentEvent event = new ComponentEvent(
              componentName, CONTAINER_RECOVERY_FAILED)
              .setInstance(entry.getValue())
              .setContainerId(entry.getKey());
          componentsByName.get(componentName).handle(event);
          amRMClient.releaseAssignedContainer(entry.getKey());
        }
      }, containerWaitMs, TimeUnit.MILLISECONDS);
{code} 
- Clear the unRecoveredInstances in the end of this executor thread. it'll 
remain in memory forever otherwise.
- I think there may be a race condition where additional pending instance got 
added. 
1. In the executor thread loop, get the unRecovered instance object (say, 
instance1)
2. In onContainersReceivedFromPreviousAttempts: remove unrecovered instance1 
from the map
3. In onContainersReceivedFromPreviousAttempts: send recover event
4. In Component: to handle the recover event, remove from pending instance list
5. But then Executor thread add (instance1) back to the pending list 
We probably need to synchronize on the entire unRecoveredInstance map ? 


> Yarn service pre-maturely releases the container after AM restart 
> ------------------------------------------------------------------
>
>                 Key: YARN-7565
>                 URL: https://issues.apache.org/jira/browse/YARN-7565
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>             Fix For: yarn-native-services
>
>         Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to