[ 
https://issues.apache.org/jira/browse/SLIDER-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gour Saha updated SLIDER-479:
-----------------------------
    Description: 
A container can continue to run even after a slider stop command has been 
issued. One such scenarios is when NM of a non Slider-AM node is lost (for some 
intermittent reason) and then slider stop command is issued. YARN will not be 
able to clean up the stranded agent (and the application processes). In such a 
scenario even if the NM is brought back up later YARN does not kill these 
containers.

In a large cluster with several applications deployed/managed by slider there 
could easily be numerous such stranded containers.

Slider client could expose a "stop-all" command or maybe an option "stop 
--clean" (or anything appropriate for this task) to do the cleanup. It can 
bring up the Slider-AM in clean mode (say) which will not start any application 
but will simply register to ZK and wait for these stranded agents to heart-beat 
into it. Subsequently each one of these agents should receive a terminate 
command from the AM and do necessary cleanup and shutdown.

This new command can be issued only after an application has been stopped. When 
invoked while the application is running this command should ignore/fail 
providing relevant information. This command can also provide a summary of how 
many stranded containers it cleaned up.


  was:
A container can continue to run even after a slider stop command has been 
issued. One such scenarios is when NM of a non Slider-AM node is lost and 
before the YARN could clean up the stranded agent (and the application 
processes) slider stop command was issued. In such a scenario even if the NM is 
brought back up it will not kill these containers.

In a large cluster with several applications deployed/managed by slider there 
could easily be numerous such stranded containers.

Slider client could expose a "stop-all" command or maybe an option "stop 
--clean" (or anything appropriate for this task) to do the cleanup. It can 
bring up the Slider-AM in clean mode (say) which will not start any application 
but will simply register to ZK and wait for agents to heart-beat into it. Each 
one of these agents will receive the terminate command from the AM and will do 
necessary cleanup and shutdown.

This new command can be issued only after an application has been stopped. When 
invoked while the application is running this command should fail providing 
relevant information. This command can also provide a summary of how many 
stranded containers it cleaned up.



> Provide a slider command to kill all stranded containers continuing to run 
> post stop command
> --------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-479
>                 URL: https://issues.apache.org/jira/browse/SLIDER-479
>             Project: Slider
>          Issue Type: Bug
>            Reporter: Gour Saha
>             Fix For: Slider 2.0.0
>
>
> A container can continue to run even after a slider stop command has been 
> issued. One such scenarios is when NM of a non Slider-AM node is lost (for 
> some intermittent reason) and then slider stop command is issued. YARN will 
> not be able to clean up the stranded agent (and the application processes). 
> In such a scenario even if the NM is brought back up later YARN does not kill 
> these containers.
> In a large cluster with several applications deployed/managed by slider there 
> could easily be numerous such stranded containers.
> Slider client could expose a "stop-all" command or maybe an option "stop 
> --clean" (or anything appropriate for this task) to do the cleanup. It can 
> bring up the Slider-AM in clean mode (say) which will not start any 
> application but will simply register to ZK and wait for these stranded agents 
> to heart-beat into it. Subsequently each one of these agents should receive a 
> terminate command from the AM and do necessary cleanup and shutdown.
> This new command can be issued only after an application has been stopped. 
> When invoked while the application is running this command should ignore/fail 
> providing relevant information. This command can also provide a summary of 
> how many stranded containers it cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to