wangwei created SINGA-132:
-----------------------------
Summary: Optimize training on a single node with GPUs
Key: SINGA-132
URL: https://issues.apache.org/jira/browse/SINGA-132
Project: Singa
Issue Type: Improvement
Reporter: wangwei
Assignee: Haibo Chen
There are two training situations.
1. a single worker. For this case, there is not need to launch a separate
server thread. Because it would lead to communication cost between the worker
and server. Instead, we can create an Updater inside the Worker and call it to
update the parameters locally inside the Worker. The driver's working flow
should be changed for this case, i.e., there is no need to have a stub thread
and server thread. The worker should run in the main thread and the program
terminates once the worker finishes.
2. multiple worker. For this case, we need both workers and servers. First, we
can make zookeeper an optional dependent library, as it is used for Job ID
generation and termination condition check. If no Job ID is available, we can
always use the default Job ID (0). Since there is only one process, we don't
need zookeeper to know the status of workers in other processes. Second, the
communication between worker-stub-server should be optimized, e.g., using
GPU-Direct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)