Byron Hsu created SUBMARINE-857:
-----------------------------------
Summary: [Umbrella] Support model management SDK in distributed
scenerios
Key: SUBMARINE-857
URL: https://issues.apache.org/jira/browse/SUBMARINE-857
Project: Apache Submarine
Issue Type: Task
Reporter: Byron Hsu
Submarine is a platform designed for distributed training, so its model
management SDK should be easier to use in distributed scenarios.
In a general distributed experiment, there are several workers training
together.
Our model management toolkit will support:
1. The workers in the same experiment will automatically direct their logs to
the same group in mlflow, so users can monitor multiple workers' info in one
graph.
2. When saving models, users do not need to store all the workers' because some
are replicated or redundant. Calling save_model in our toolkit, we will apply
the most efficient saving strategy under the hood, which can cost the least
space and time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]