[
https://issues.apache.org/jira/browse/SINGA-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wangwei resolved SINGA-48.
--------------------------
Resolution: Fixed
Assignee: wangwei
> Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers
> from diff groups
> --------------------------------------------------------------------------------------------
>
> Key: SINGA-48
> URL: https://issues.apache.org/jira/browse/SINGA-48
> Project: Singa
> Issue Type: Bug
> Reporter: wangwei
> Assignee: wangwei
>
> In SINGA, workers from the same group and in the same process share the same
> NeuralNet instance. Different worker groups should have different NeuralNet
> objects However, the current Trainer::SetupWorkerServer function assigns the
> same NeuralNet instance to workers in different groups. Consequently, two
> workers may compute for the same layer instance which would lead to repeated
> calling of ComputeFeature and ComputeGradient functions, and case run-time
> errors.
> Another issue is that if two workers from different groups but resident in
> the same process, they would share memory for layer parameters. The memory
> sharing has no problem if the group size is 1. But if there are more than 1
> workers in a group, they should run synchronously. The synchronization is
> controlled by parameter version. If memory sharing is enabled, workers from
> other groups may increase the parameter version that leads to errors in
> synchronization. To fix this issue, SINGA needs to disable memory sharing
> among groups if worker group size >1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)