[
https://issues.apache.org/jira/browse/SINGA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wangwei updated SINGA-132:
--------------------------
Assignee: wangwei (was: Haibo Chen)
> Optimize training on a single node with GPUs
> --------------------------------------------
>
> Key: SINGA-132
> URL: https://issues.apache.org/jira/browse/SINGA-132
> Project: Singa
> Issue Type: Improvement
> Reporter: wangwei
> Assignee: wangwei
>
> There are two training situations.
> 1. a single worker. For this case, there is not need to launch a separate
> server thread. Because it would lead to communication cost between the worker
> and server. Instead, we can create an Updater inside the Worker and call it
> to update the parameters locally inside the Worker. The driver's working flow
> should be changed for this case, i.e., there is no need to have a stub thread
> and server thread. The worker should run in the main thread and the program
> terminates once the worker finishes.
> 2. multiple worker. For this case, we need both workers and servers. First,
> we can make zookeeper an optional dependent library, as it is used for Job ID
> generation and termination condition check. If no Job ID is available, we can
> always use the default Job ID (0). Since there is only one process, we don't
> need zookeeper to know the status of workers in other processes. Second, the
> communication between worker-stub-server should be optimized, e.g., using
> GPU-Direct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)