[ 
https://issues.apache.org/jira/browse/SINGA-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei resolved SINGA-48.
--------------------------
    Resolution: Fixed
      Assignee: wangwei

> Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers 
> from diff groups
> --------------------------------------------------------------------------------------------
>
>                 Key: SINGA-48
>                 URL: https://issues.apache.org/jira/browse/SINGA-48
>             Project: Singa
>          Issue Type: Bug
>            Reporter: wangwei
>            Assignee: wangwei
>
> In SINGA, workers from the same group and in the same process share the same 
> NeuralNet instance. Different worker groups should have different NeuralNet 
> objects However, the current Trainer::SetupWorkerServer function assigns the 
> same NeuralNet instance to workers in different groups. Consequently, two 
> workers may compute for the same layer instance which would lead to repeated 
> calling of ComputeFeature and ComputeGradient functions, and case run-time 
> errors.
> Another issue is that if two workers from different groups but resident in 
> the same process, they would share memory for layer parameters. The memory 
> sharing has no problem if the group size is 1. But if there are more than 1 
> workers in a group, they should run synchronously. The synchronization is 
> controlled by parameter version. If memory sharing is enabled, workers from 
> other groups may increase the parameter version that leads to errors in 
> synchronization. To fix this issue, SINGA needs to disable memory sharing 
> among groups if worker group size >1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to