wangwei created SINGA-132:
-----------------------------

             Summary: Optimize training on a single node with GPUs
                 Key: SINGA-132
                 URL: https://issues.apache.org/jira/browse/SINGA-132
             Project: Singa
          Issue Type: Improvement
            Reporter: wangwei
            Assignee: Haibo Chen


There are two training situations. 
1. a single worker. For this case, there is not need to launch a separate 
server thread. Because it would lead to communication cost between the worker 
and server. Instead, we can create an  Updater inside the Worker and call it to 
update the parameters locally inside the Worker. The driver's working flow 
should be changed for this case, i.e., there is no need to have a stub thread 
and server thread. The worker should run in the main thread and the program 
terminates once the worker finishes.

2. multiple worker. For this case, we need both workers and servers. First, we 
can make zookeeper an optional dependent library, as it is used for Job ID 
generation and termination condition check. If no Job ID is available, we can 
always use the default Job ID (0). Since there is only one process, we don't 
need zookeeper to know the status of workers in other processes. Second, the 
communication between worker-stub-server should be optimized, e.g., using 
GPU-Direct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to