[jira] [Comment Edited] (SYSTEMML-1760) Improve engine robustness of distributed SGD training

Fei Hu (JIRA) Fri, 28 Jul 2017 12:03:31 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105497#comment-16105497
 ]


Fei Hu edited comment on SYSTEMML-1760 at 7/28/17 7:02 PM:
-----------------------------------------------------------

The following table shows the history of performance improvement. After fixing 
the issues SYSTEMML-1762 and SYSTEMML-1774, the distributed MNIST_LeNet model 
could be trained in parallel with the Hybrid_Spark and Remote_Spark parfor 
mode. By changing the default Parfor_Result_Merge into REMOTE_SPARK, the run 
time reduced a lot. It indicates that the result merge may be a bottleneck for 
the performance. 

!Runtime_Table.png!


was (Author: tenma):
The following table shows the history of performance improvement. After fixing 
the issues SYSTEMML-1762 and 1774, the distributed MNIST_LeNet model could be 
trained in parallel with the Hybrid_Spark and Remote_Spark parfor mode. By 
changing the default Parfor_Result_Merge into REMOTE_SPARK, the run time 
reduced a lot. It indicates that the result merge may be a bottleneck for the 
performance. 

!Runtime_Table.png!

> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1760
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>            Reporter: Mike Dusenberry
>            Assignee: Fei Hu
>         Attachments: Runtime_Table.png
>
>
> Currently, we have a mathematical framework in place for training with 
> distributed SGD in a [distributed MNIST LeNet example | 
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
>   This task aims to push this at scale to determine (1) the current behavior 
> of the engine (i.e. does the optimizer actually run this in a distributed 
> fashion, and (2) ways to improve the robustness and performance for this 
> scenario.  The distributed SGD framework from this example has already been 
> ported into Caffe2DML, and thus improvements made for this task will directly 
> benefit our efforts towards distributed training of Caffe models (and Keras 
> in the future).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (SYSTEMML-1760) Improve engine robustness of distributed SGD training

Reply via email to