[ https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-1563: -------------------------------------- Description: This aims to add a *distributed synchronous SGD* MNIST LeNet example. In distributed synchronous SGD, multiple mini-batches are run forward & backward simultaneously, and the gradients are aggregated together by addition before the model parameters are updated. This is mathematically equivalent to simply using a large mini-batch size, i.e. {{new_mini_batch_size = mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that distributed synchronous SGD can make use of multiple devices, i.e. multiple GPUs or multiple CPU machines, and thus can speed up training time. More specifically, using an effectively larger mini-batch size can yield a more stable gradient in expectation, and a larger number of epochs can be run in the same amount of time, both of which lead to faster convergence. Alternatives include various forms of distributed _asynchronous_ SGD, such as Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain / Open AI has found evidence supporting the claim that distributed synchronous SGD can lead to faster convergence, particularly if it is extending with the notion of "backup workers" as described in the paper. We will first aim for distributed synchronous SGD with no backup workers, and then extend this to include backup workers. The MNIST LeNet model will simply serve as an example, and this same approach can be extended to more recent models, such as ResNets. \[1]: https://arxiv.org/abs/1604.00981 was: This aims to add a *distributed synchronous SGD* MNIST LeNet example. In distributed synchronous SGD, multiple mini-batches are run forward & backward simultaneously, and the gradients are aggregated together by addition before the model parameters are updated. This is mathematically equivalent to simply using a large mini-batch size, i.e. {{new_mini_batch_size = mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that distributed synchronous SGD can make use of multiple devices, i.e. multiple GPUs or multiple CPU machines, and thus can speed up training time. More specifically, using an effectively larger mini-batch size can yield a more stable gradient in expectation, and a larger number of epochs can be run in the same amount of time, both of which lead to faster convergence. Alternatives include various forms of distributed _asynchronous_ SGD, such as Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain / Open AI has found evidence supporting the claim that distributed synchronous SGD can lead to faster convergence, particularly if it is extending with the notion of "backup workers" as described in the paper. We will first aim for distributed synchronous SGD with no backup workers, and then extend this to include backup workers. The MNIST LeNet model will simply serve as an example, and this same approach can be extended to more recent models, such as resnets. \[1]: https://arxiv.org/abs/1604.00981 > Add a distributed synchronous SGD MNIST LeNet example > ----------------------------------------------------- > > Key: SYSTEMML-1563 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1563 > Project: SystemML > Issue Type: Sub-task > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > > This aims to add a *distributed synchronous SGD* MNIST LeNet example. In > distributed synchronous SGD, multiple mini-batches are run forward & backward > simultaneously, and the gradients are aggregated together by addition before > the model parameters are updated. This is mathematically equivalent to > simply using a large mini-batch size, i.e. {{new_mini_batch_size = > mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that > distributed synchronous SGD can make use of multiple devices, i.e. multiple > GPUs or multiple CPU machines, and thus can speed up training time. More > specifically, using an effectively larger mini-batch size can yield a more > stable gradient in expectation, and a larger number of epochs can be run in > the same amount of time, both of which lead to faster convergence. > Alternatives include various forms of distributed _asynchronous_ SGD, such as > Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain / > Open AI has found evidence supporting the claim that distributed synchronous > SGD can lead to faster convergence, particularly if it is extending with the > notion of "backup workers" as described in the paper. > We will first aim for distributed synchronous SGD with no backup workers, and > then extend this to include backup workers. The MNIST LeNet model will > simply serve as an example, and this same approach can be extended to more > recent models, such as ResNets. > \[1]: https://arxiv.org/abs/1604.00981 -- This message was sent by Atlassian JIRA (v6.3.15#6346)