[ 
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149654#comment-15149654
 ] 

Jun Gong commented on YARN-3998:
--------------------------------

Sorry for late reply, I was on holiday.

Thanks [~vinodkv] and [~vvasudev] for suggestion and review!

Some additional thought besides [~vvasudev]'s opinion: 
{quote}
Unification with AM restart policies
{quote}
I agree with [~vvasudev]. Now AM restart polices is retrying across different 
nodes, this feature is retrying on local node. When RM launches AM, it could 
specify local retry policy for it.  

{quote}
Treat relaunch in a first-class manner
{quote}
Glad to see it to be a first-class manner, I will update the patch.

{quote}
The following isn’t fool-proof and won’t work for all apps, can we just persist 
and read the selected log-dir from the state-store?
ContainerLaunch.handleContainerExitWithFailure() needs to handled differently 
during container-relaunches.
The same can be done for the work-dir.
All of these are related. If we store the log dir and work dir in the state 
store, we can address all 3 of these. 
{quote}
Yes, it will be better to store the log dir and work dir if we aims to make it 
more accurate. I was thinking to make minimal changes for this feature.

{quote}
In fact, if we end up changing the work-dir during relaunch due to a bad-dir, 
that may result in a breakage for the app. Apps may be reading from / writing 
into the work-dir and changing it during relaunch may invalidate application's 
assumptions. Should we just fail the container completely and let the AM deal 
with it?
{quote}
My thought is that if user specifies retry policy on container, the user should 
make sure that container could deal with this situation.

{quote}
Instead of removing a line and setting the limit to 10*1000, take the last 'n' 
characters in the string where 'n' is a config setting.
{quote}
It might make the diagnostics not consistent to remove the last n characters, 
suppose the  diagnostics is “The exception is XXXX” and there is n characters 
in XXX, the diagnositics becomes “The exception is”. There is similar problem 
by removing first or last n lines. How about removing previous attempts' error 
information and just keeping the latest attempt's information? 

Glad to see more discussion about the feature.

> Add retry-times to let NM re-launch container when it fails to run
> ------------------------------------------------------------------
>
>                 Key: YARN-3998
>                 URL: https://issues.apache.org/jira/browse/YARN-3998
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3998.01.patch, YARN-3998.02.patch, 
> YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, YARN-3998.06.patch
>
>
> I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
> launches containers, it could specify the value. Then NM will re-launch the 
> container 'retry-times' times when it fails to run(e.g.exit code is not 0). 
> It will save a lot of time. It avoids container localization. RM does not 
> need to re-schedule the container. And local files in container's working 
> directory will be left for re-use.(If container have downloaded some big 
> files, it does not need to re-download them when running again.) 
> We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to