[ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fei Hu updated SYSTEMML-1760: ----------------------------- Attachment: Runtime_Table.png > Improve engine robustness of distributed SGD training > ----------------------------------------------------- > > Key: SYSTEMML-1760 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1760 > Project: SystemML > Issue Type: Improvement > Components: Algorithms, Compiler, ParFor > Reporter: Mike Dusenberry > Assignee: Fei Hu > Attachments: Runtime_Table.png > > > Currently, we have a mathematical framework in place for training with > distributed SGD in a [distributed MNIST LeNet example | > https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml]. > This task aims to push this at scale to determine (1) the current behavior > of the engine (i.e. does the optimizer actually run this in a distributed > fashion, and (2) ways to improve the robustness and performance for this > scenario. The distributed SGD framework from this example has already been > ported into Caffe2DML, and thus improvements made for this task will directly > benefit our efforts towards distributed training of Caffe models (and Keras > in the future). -- This message was sent by Atlassian JIRA (v6.4.14#64029)