[ 
https://issues.apache.org/jira/browse/YARN-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xun Liu updated YARN-8876:
--------------------------
    Description: 
h1. Job monitor long-running service of submarine

After training, the monitoring program need auto close PS service. It is 
possible that other deep learning frameworks also have some custom processing 
when the tasks are in different states.

The submarine needs to provide a long-term resident service that monitors each 
JOB mission.

This monitoring service can be processed differently according to the training 
tasks of different depth learning framework types.

For example: Tensorflow performs distributed training, when the training is 
completed,

The PS service cannot be automatically stopped. At this time, the PS needs to 
be actively stopped by the monitoring service.

  was:
h1. Job monitor service of submarine

After training, the monitoring program need auto close PS service. It is 
possible that other deep learning frameworks also have some custom processing 
when the tasks are in different states.

The submarine needs to provide a long-term resident service that monitors each 
JOB mission.

This monitoring service can be processed differently according to the training 
tasks of different depth learning framework types.

For example: Tensorflow performs distributed training, when the training is 
completed,

The PS service cannot be automatically stopped. At this time, the PS needs to 
be actively stopped by the monitoring service.


> [Submarine] Job monitor long-running service of submarine
> ---------------------------------------------------------
>
>                 Key: YARN-8876
>                 URL: https://issues.apache.org/jira/browse/YARN-8876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Xun Liu
>            Assignee: Xun Liu
>            Priority: Major
>
> h1. Job monitor long-running service of submarine
> After training, the monitoring program need auto close PS service. It is 
> possible that other deep learning frameworks also have some custom processing 
> when the tasks are in different states.
> The submarine needs to provide a long-term resident service that monitors 
> each JOB mission.
> This monitoring service can be processed differently according to the 
> training tasks of different depth learning framework types.
> For example: Tensorflow performs distributed training, when the training is 
> completed,
> The PS service cannot be automatically stopped. At this time, the PS needs to 
> be actively stopped by the monitoring service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to