Mike Dusenberry created SYSTEMML-1563:
-----------------------------------------

             Summary: Add a distributed synchronous SGD MNIST LeNet example
                 Key: SYSTEMML-1563
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1563
             Project: SystemML
          Issue Type: Sub-task
            Reporter: Mike Dusenberry
            Assignee: Mike Dusenberry


This aims to add a distributed synchronous SGD MNIST LeNet example.  In 
distributed synchronous SGD, multiple mini-batches are run forward & backward 
simultaneously, and the gradients are aggregated together by addition before 
the model parameters are updated.  This is mathematically equivalent to simply 
using a large mini-batch size, i.e. {{new_mini_batch_size = mini_batch_size * 
number_of_parallel_mini_batches}}.  The benefit is that distributed synchronous 
SGD can make use of multiple devices, i.e. multiple GPUs or multiple CPU 
machines, and thus can speed up training time.  More specifically, using an 
effectively larger mini-batch size can yield a more stable gradient in 
expectation, and a larger number of epochs can be run in the same amount of 
time, both of which lead to faster convergence.  Alternatives include various 
forms of distributed *asynchronous* SGD, such as Downpour, Hogwild, etc.  
However, a recent paper \[1] from Google Brain / Open AI has found evidence 
supporting the claim that distributed synchronous SGD can lead to faster 
convergence, particularly if it is extending with the notion of "backup 
workers" as described in the paper.

We will first aim for distributed synchronous SGD with no backup workers, and 
then extend this to include backup workers.  The MNIST LeNet model will simply 
serve as an example, and this same approach can be extended to more recent 
models, such as resnets.

\[1]: https://arxiv.org/abs/1604.00981



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to