[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134366#comment-15134366
 ] 

Varun Vasudev edited comment on YARN-3998 at 2/5/16 3:59 PM:
-------------------------------------------------------------

Thanks for your comments Vinod. Overall, I mostly agree with your thoughts - it 
might make sense to file an umbrella JIRA to add first class container retry 
support and have this become a sub-task.

With regards to your specific comments -

bq. Unification with AM restart policies 
This should be taken up as a different JIRA - there are significant differences 
between AM restarts and container restarts - the most important one being that 
AM restarts can occur on different nodes(which is probably the desired 
behavior) but container relaunches by definition have to be on the same node. 
We might be able to re-use the notion of retrying on specific error codes, but 
it probably requires some more discussion.
 
bq. Treat relaunch in a first-class manner
Makes sense - [~hex108] - can you incorporate these into your next patch?

bq. The relaunch feature needs to work across NM restarts, so we should save 
the retry-context and policy per container into the state-store and reload it 
for continue relaunching after NM restart.

The container retry policy details are already stored in the state-store as 
part of the ContainerLaunchContext
    
bq. We should also handle restarting of any containers that may have crashed 
during the NM reboot.
This should be handled as a separate JIRA and not as part of this one - 
especially because we won't have the reason the container crashed if it crashed 
during the NM reboot.
    

bq. The following isn’t fool-proof and won’t work for all apps, can we just 
persist and read the selected log-dir from the state-store?
bq. ContainerLaunch.handleContainerExitWithFailure() needs to handled 
differently during container-relaunches.
bq. The same can be done for the work-dir.    

All of these are related. If we store the log dir and work dir in the state 
store, we can address all 3 of these. [~hex108] - what do you think?

bq.    In fact, if we end up changing the work-dir during relaunch due to a 
bad-dir, that may result in a breakage for the app. Apps may be reading from / 
writing into the work-dir and changing it during relaunch may invalidate 
application's assumptions. Should we just fail the container completely and let 
the AM deal with it?

I suspect we'll have to add an additional flag which the AMs will have to set 
to decide the behaviour. There are many valid cases where containers can store 
state on mounted volumes and don't rely on the YARN local dirs. On the other 
hand, like you pointed out in some cases, we might want to abandon the 
relaunch. I'm not sure if this should be done as part of this JIRA or another 
one.

[~hex108] with regards to your latest patch -

1) Can you change RETRY_ON_SPECIFIC_ERROR_CODE to RETRY_ON_SPECIFIC_ERROR_CODES

2)
{code}
+    if (diagnostics.length() > DIAGNOSTICS_MAX_SIZE) {
+      // delete first line
+      int i = diagnostics.indexOf("\n");
+      if (i != -1) {
+        diagnostics.delete(0, i + 1);
+      }
+    }
{code}
Instead of this - can you please change the function to
a) Only trim the diagnostics string size in case of a relaunch
b) Instead of removing a line and setting the limit to 10*1000, take the last 
'n' characters in the string where 'n' is a config setting.

3)
{code}
+    } catch (IOException e) {
+      LOG.error("Failed to delete " + relativePath, e);
+    }
{code}
and 
{code}
+    } catch (IOException e) {
+      LOG.error("Failed to delete container script file and token file", e);
+    }
+
{code}
Change LOG.error to LOG.warn

4) The current version of the patch doesn't work with LCE because the NM can't 
cleanup the launch container script and the tokens(they're owned by the user 
who submitted the job/nobody). To fix this requires a change in the 
container-executor binary, which will involve passing a flag to the 
container-executor binary to relaunch a container. [~vinodkv] - what do you 
think we should do here - hold off committing this patch until we have support 
for relaunch in container-executor or commit it documenting the fact that it'll 
only work with DefaultContainerExecutor and file a follow up JIRA?


was (Author: vvasudev):
Thanks for your comments Vinod. Overall, I mostly agree with your thoughts - it 
might make sense to file an umbrella JIRA to add first class container retry 
support and have this become a sub-task.

With regards to your specific comments -

bq. Unification with AM restart policies 
This should be taken up as a different JIRA - there are significant differences 
between AM restarts and container restarts - the most important one being that 
AM restarts can occur on different nodes(which is probably the desired 
behavior) but container relaunches by definition have to be on the same node. 
We might be able to re-use the notion of retrying on specific error codes, but 
it probably requires some more discussion.
 
bq. Treat relaunch in a first-class manner
Makes sense - [~hex108] - can you incorporate these into your next patch?

bq. The relaunch feature needs to work across NM restarts, so we should save 
the retry-context and policy per container into the state-store and reload it 
for continue relaunching after NM restart.

The container retry policy details are already stored in the state-store as 
part of the ContainerLaunchContext
    
bq. We should also handle restarting of any containers that may have crashed 
during the NM reboot.
This should be handled as a separate JIRA and not as part of this one - 
especially because we won't have the reason the container crashed if it crashed 
during the NM reboot.
    

bq. The following isn’t fool-proof and won’t work for all apps, can we just 
persist and read the selected log-dir from the state-store?
bq. ContainerLaunch.handleContainerExitWithFailure() needs to handled 
differently during container-relaunches.
bq. The same can be done for the work-dir.    

All of these are related. If we store the log dir and work dir in the state 
store, we can address all 3 of there. [~hex108] - what do you think?

bq.    In fact, if we end up changing the work-dir during relaunch due to a 
bad-dir, that may result in a breakage for the app. Apps may be reading from / 
writing into the work-dir and changing it during relaunch may invalidate 
application's assumptions. Should we just fail the container completely and let 
the AM deal with it?

I suspect we'll have to add an additional flag which the AMs will have to set 
to decide the behaviour. There are many valid cases where containers can store 
state on mounted volumes and don't rely on the YARN local dirs. On the other 
hand, like you pointed out in some cases, we might want to abandon the 
relaunch. I'm not sure if this should be done as part of this JIRA or another 
one.

[~hex108] with regards to your latest patch -

1) Can you change RETRY_ON_SPECIFIC_ERROR_CODE to RETRY_ON_SPECIFIC_ERROR_CODES

2)
{code}
+    if (diagnostics.length() > DIAGNOSTICS_MAX_SIZE) {
+      // delete first line
+      int i = diagnostics.indexOf("\n");
+      if (i != -1) {
+        diagnostics.delete(0, i + 1);
+      }
+    }
{code}
Instead of this - can you please change the function to
a) Only trim the diagnostics string size in case of a relaunch
b) Instead of removing a line and setting the limit to 10*1000, take the last 
'n' characters in the string where 'n' is a config setting.

3)
{code}
+    } catch (IOException e) {
+      LOG.error("Failed to delete " + relativePath, e);
+    }
{code}
and 
{code}
+    } catch (IOException e) {
+      LOG.error("Failed to delete container script file and token file", e);
+    }
+
Change LOG.error to LOG.warn

4) The current version of the patch doesn't work with LCE because the NM can't 
cleanup the launch container script and the tokens(they're owned by the user 
who submitted the job/nobody). To fix this requires a change in the 
container-executor binary, which will involve passing a flag to the 
container-executor binary to relaunch a container. [~vinodkv] - what do you 
think we should do here - hold off committing this patch until we have support 
for relaunch in container-executor or commit it documenting the fact that it'll 
only work with DefaultContainerExecutor and file a follow up JIRA?

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to