[jira] [Commented] (SINGA-32) Implement AllReduce training framework

ASF subversion and git services (JIRA) Sat, 18 Jul 2015 01:39:19 -0700

    [ 
https://issues.apache.org/jira/browse/SINGA-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632337#comment-14632337
 ]


ASF subversion and git services commented on SINGA-32:
------------------------------------------------------

Commit 96bedb2264f7d4ebd8a2a0cad67dc9a91f5419c9 in incubator-singa's branch 
refs/heads/master from wang wei
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=96bedb2 ]

SINGA-32 Implement synchronous training framework

Fix a bug from InitLocalParam() of Worker class.
One worker owns one Param if the param's data blob is not shared from other 
workers.
Previously, a Worker would not send Get request for one Param if it owns this 
Param.
But it may not init the Param locally because it is not the first group in a 
group
set which subscribe to the same server group.

To fix the bug, all workers would send Get requests for Params in its local 
layers.
There would not extra cost for getting Params owned by the worker (from the 
first group),
because the get reqest would not be sent (the param version is already the 
latest).


> Implement AllReduce training framework
> --------------------------------------
>
>                 Key: SINGA-32
>                 URL: https://issues.apache.org/jira/browse/SINGA-32
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: wangwei
>
> The AllReduce training framework runs in synchronous mode, where one worker 
> starts the next iteration after all workers have finished the previous 
> iteration. Baidu's deepimage system uses this training framework.
> To implement it in SINGA, we launch one worker group and one server group. 
> The model is partitioned (e.g., on dimension 0) among all workers. Params are 
> sliced and partitioned among all servers. 
> At the beginning, each Param (slice) is put into server shard including 
> number of workers computing gradient for it.
> For each iteration, the local stub aggregates all gradients for the same 
> Param and sends to corresponding server including the number of local workers 
> computing gradient for it. The server will buffer update requests and 
> conducts update for a Param slice until it receives gradients from all 
> workers. It sends back the updated Param (slices) to the corresponding 
> process (stub).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SINGA-32) Implement AllReduce training framework

Reply via email to