[ https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105497#comment-16105497 ]
Fei Hu edited comment on SYSTEMML-1760 at 7/28/17 7:02 PM: ----------------------------------------------------------- The following table shows the history of performance improvement. After fixing the issues SYSTEMML-1762 and SYSTEMML-1774, the distributed MNIST_LeNet model could be trained in parallel with the Hybrid_Spark and Remote_Spark parfor mode. By changing the default Parfor_Result_Merge into REMOTE_SPARK, the run time reduced a lot. It indicates that the result merge may be a bottleneck for the performance. !Runtime_Table.png! was (Author: tenma): The following table shows the history of performance improvement. After fixing the issues SYSTEMML-1762 and 1774, the distributed MNIST_LeNet model could be trained in parallel with the Hybrid_Spark and Remote_Spark parfor mode. By changing the default Parfor_Result_Merge into REMOTE_SPARK, the run time reduced a lot. It indicates that the result merge may be a bottleneck for the performance. !Runtime_Table.png! > Improve engine robustness of distributed SGD training > ----------------------------------------------------- > > Key: SYSTEMML-1760 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1760 > Project: SystemML > Issue Type: Improvement > Components: Algorithms, Compiler, ParFor > Reporter: Mike Dusenberry > Assignee: Fei Hu > Attachments: Runtime_Table.png > > > Currently, we have a mathematical framework in place for training with > distributed SGD in a [distributed MNIST LeNet example | > https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml]. > This task aims to push this at scale to determine (1) the current behavior > of the engine (i.e. does the optimizer actually run this in a distributed > fashion, and (2) ways to improve the robustness and performance for this > scenario. The distributed SGD framework from this example has already been > ported into Caffe2DML, and thus improvements made for this task will directly > benefit our efforts towards distributed training of Caffe models (and Keras > in the future). -- This message was sent by Atlassian JIRA (v6.4.14#64029)