[
https://issues.apache.org/jira/browse/SUBMARINE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
cdmikechen updated SUBMARINE-857:
---------------------------------
Target Version: 0.6.0 (was: 0.9.0)
> [Umbrella] Support model management SDK in distributed scenerios
> ----------------------------------------------------------------
>
> Key: SUBMARINE-857
> URL: https://issues.apache.org/jira/browse/SUBMARINE-857
> Project: Apache Submarine
> Issue Type: Task
> Reporter: Byron Hsu
> Assignee: Byron Hsu
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Submarine is a platform designed for distributed training, so its model
> management SDK should be easier to use in distributed scenarios.
> In a general distributed experiment, there are several workers training
> together.
> Our model management toolkit will support:
> 1. The workers in the same experiment will automatically direct their logs
> to the same group in mlflow, so users can monitor multiple workers' info in
> one graph.
> 2. When saving models, users do not need to store all the workers' because
> some are replicated or redundant. Calling save_model in our toolkit, we will
> apply the most efficient saving strategy under the hood, which can cost the
> least space and time.
> The API design doc can be viewed here:
> [https://hackmd.io/I6frSeZIQDaKQYK4nGCR5w?both]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]