[
https://issues.apache.org/jira/browse/SINGA-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wangwei resolved SINGA-12.
--------------------------
Resolution: Fixed
> Supprt Checkpoint and Restore
> -----------------------------
>
> Key: SINGA-12
> URL: https://issues.apache.org/jira/browse/SINGA-12
> Project: Singa
> Issue Type: New Feature
> Reporter: Sheng Wang
> Assignee: Sheng Wang
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> With the support of checkpoint, we can provide following features:
> 1. Failure Recovery: when a task is failed during the training, we can
> recover the task from the latest checkpoint;
> 2. Continuous Training: when the user checks the trained model and finds that
> more steps are needed, he can continue the training;
> 3. Parameter Reuse: from a previously trained model, we can create a new
> model by adding new layers on top of it, and reuse the parameters during the
> training.
> The checkpoint should be done on the server side every few steps. In
> addition, a final checkpoint will be made when the task is finished.
> During restore, the servers/workers will be firstly set up as normal, and
> after that parameters will be loaded from the checkpoint file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)