[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298747#comment-16298747
 ] 

Eric Yang edited comment on YARN-7565 at 12/20/17 5:33 PM:
---

Thank you for point out the ServiceRecord.description maps to container name 
(and not Service Spec description field), but it appears to be a race condition 
for newly created application.  serviceStart is invoked recoverComponent first. 
 Application hasn't registered with Registry yet.  This looks like the reason 
that we get null pointer exception.


was (Author: eyang):
Thank you for point out the record.description maps to container name, but it 
appears to be a race condition for newly created application.  serviceStart is 
invoked recoverComponent first.  Application hasn't registered with Registry 
yet.  This looks like the reason that we get null pointer exception.

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: 3.1.0
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, 
> YARN-7565.addendum.001.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-11 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286746#comment-16286746
 ] 

Chandni Singh edited comment on YARN-7565 at 12/11/17 11:33 PM:


bq. yarn.service.container.expiry-interval-ms -> 
yarn.service.am-recovery.timeout ?
How about yarn.service.container-recovery.timeout? "am-recovery.timeout" gives 
an impression that there is a timeout for AM recovery.


was (Author: csingh):
> yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout 
> ?
How about yarn.service.container-recovery.timeout? "am-recovery.timeout" gives 
an impression that there is a timeout for AM recovery.

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: yarn-native-services
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285539#comment-16285539
 ] 

Jian He edited comment on YARN-7565 at 12/11/17 6:00 AM:
-

- yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout 
?
- better move ServiceScheduler.YARN_COMPONENT to YarnRegistryAttributes as well
- this logic seems a bit confusing. It sends a recover event, but adds to 
unRecoveredInstances which looks contradictoy by naming. 
All that needed is to call "component.pendingInstances.remove(instance);". How 
about just expose this method as an API ? rather than sending the event ?
{code}
  if (componentName != null) {
Component component = componentsByName.get(componentName);
ComponentInstance compInstance = component.getComponentInstance(
record.description);
ContainerId containerId = ContainerId.fromString(record.get(
YarnRegistryAttributes.YARN_ID));
ComponentEvent event = new ComponentEvent(componentName,
CONTAINER_RECOVERED)
.setInstance(compInstance)
.setContainerId(containerId);
unRecoveredInstances.put(containerId, compInstance);
component.handle(event);
  }
{code}
- Similary, ContainerRecoveryFailedTransition can be removed, just call the 
existing reInsertPendingInstance API instead (rename it to 
insertPendingInstance)
- add componentInstanceId to the log like below 
{code}
  LOG.info("{}, Wait on container {} expired", 
entry.getValue().getCompInstanceId, entry.getKey());
{code}
- Looks like it only adds to the pending instance list when recover timeouts, 
do we need to re-request the container ? 
{code}
if (unRecoveredInstances.size() > 0) {
  executorService.schedule(() -> {
// after containerWaitMs, all the containers that haven't be recovered
// by RM will released. The corresponding Component Instances will be
// adding to the pending queues of their respective component.
Iterator> iterator =
unRecoveredInstances.entrySet().iterator();
while (iterator.hasNext()) {
  Map.Entry entry = iterator.next();
  iterator.remove();
  LOG.info("{} Wait on container {} expired", 
entry.getValue().getCompInstanceId(), entry.getKey());
  String componentName = entry.getValue().getCompName();
  ComponentEvent event = new ComponentEvent(
  componentName, CONTAINER_RECOVERY_FAILED)
  .setInstance(entry.getValue())
  .setContainerId(entry.getKey());
  componentsByName.get(componentName).handle(event);
  amRMClient.releaseAssignedContainer(entry.getKey());
}
  }, containerWaitMs, TimeUnit.MILLISECONDS);
{code} 
- Clear the unRecoveredInstances in the end of this executor thread. it'll 
remain in memory forever otherwise.
- I think there may be a race condition where additional pending instance got 
added. 
1. In the executor thread loop, get the unRecovered instance object (say, 
instance1)
2. In onContainersReceivedFromPreviousAttempts: remove unrecovered instance1 
from the map
3. In onContainersReceivedFromPreviousAttempts: send recover event
4. In Component: to handle the recover event, remove from pending instance list
5. But then Executor thread add (instance1) back to the pending list 
We probably need to synchronize on the entire unRecoveredInstance map ? 



was (Author: jianhe):
- yarn.service.container.expiry-interval-ms -> yarn.service.am-recovery.timeout 
?
- better move ServiceScheduler.YARN_COMPONENT to YarnRegistryAttributes as well
- this logic seems a bit confusing. It sends a recover event, but adds to 
unRecoveredInstances which looks contradictoy by naming. 
All that needed is to call "component.pendingInstances.remove(instance);". How 
about just expose this method as an API ? rather than sending the event ?
{code}
  if (componentName != null) {
Component component = componentsByName.get(componentName);
ComponentInstance compInstance = component.getComponentInstance(
record.description);
ContainerId containerId = ContainerId.fromString(record.get(
YarnRegistryAttributes.YARN_ID));
ComponentEvent event = new ComponentEvent(componentName,
CONTAINER_RECOVERED)
.setInstance(compInstance)
.setContainerId(containerId);
unRecoveredInstances.put(containerId, compInstance);
component.handle(event);
  }
{code}
- Similary, ContainerRecoveryFailedTransition can be removed, just call the 
above addPendingInstance API instead
- add componentInstanceId to the log like below 
{code}
  LOG.info("{}, Wait on container {} expired", 
entry.getValue().getCompInstanceId, entry.getKey());
{code}
- Looks like it only adds to the pending instance list when re

[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-05 Thread Chandni Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279381#comment-16279381
 ] 

Chandni Singh edited comment on YARN-7565 at 12/6/17 12:26 AM:
---

Patch 2
- Addressed [~jianhe]'s comments
- Instead of checking repeatedly if containers are recovered, made the change 
in ServiceScheduler to check just once if the unRecovered containers are still 
there. If they haven't been recovered by then, then the container is released 
and component instance is added to pending queue.
- Added another test


was (Author: csingh):
Patch 2
- Addressed [~jianhe]'s comments
- Instead of checking repeatedly if containers are recovered, made the change 
in ServiceScheduler to check just once if the unRecovered containers are 
available.
- Added another test

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: yarn-native-services
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org