[ 
https://issues.apache.org/jira/browse/YARN-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734800#comment-14734800
 ] 

Junping Du commented on YARN-3337:
----------------------------------

I think there is one difficulty here: it looks like we didn't keep finished 
container info in RM scheduler info but only keep live containers info (in 
SchedulerApplicationAttempt). If no dead container info get preserved in RM, 
the new added API can only send kill container event but no way to know if 
container get killed actually (no way to differentiate a wrong container ID or 
an ID for finished container). CLI could be better as it can query running 
container list first, then kill it and wait container is not active. 
If we want exactly the same semantic as kill apps API, then we have to make RM 
to track info for dead containers which sounds too overkill to me as it force 
RM to track all containers for all applications (complexity become the same as 
MRv1).
May be a better trade-off here is: the semantic for forceKillContainer() only 
means to send kill containers events but not means container get killed or not. 
A boolean value response for forceKillContainer() indicate if we found a live 
container to kill or not. So we could lose Idempotent property for this API?

> Provide YARN chaos monkey
> -------------------------
>
>                 Key: YARN-3337
>                 URL: https://issues.apache.org/jira/browse/YARN-3337
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: test
>    Affects Versions: 2.7.0
>            Reporter: Steve Loughran
>
> To test failure resilience today you either need custom scripts or implement 
> Chaos Monkey-like logic in your application (SLIDER-202). 
> Killing AMs and containers on a schedule & probability is the core activity 
> here, one that could be handled by a CLI App/client lib that does this. 
> # entry point to have a startup delay before acting
> # frequency of chaos wakeup/polling
> # probability to AM failure generation (0-100)
> # probability of non-AM container kill
> # future: other operations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to