Mike Dusenberry created SYSTEMML-1760:
-----------------------------------------

             Summary: Improve engine robustness of distributed SGD training
                 Key: SYSTEMML-1760
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
             Project: SystemML
          Issue Type: Improvement
            Reporter: Mike Dusenberry
            Assignee: Fei Hu


Currently, we have a mathematical framework in place for training with 
distributed SGD in a [distributed MNIST LeNet example | 
https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
  This task aims to push this at scale to determine (1) the current behavior of 
the engine (i.e. does the optimizer actually run this in a distributed fashion, 
and (2) ways to improve the robustness and performance for this scenario.  The 
distributed SGD framework from this example has already been ported into 
Caffe2DML, and thus improvements made for this task will directly benefit our 
efforts towards distributed training of Caffe models (and Keras in the future).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to