[ 
https://issues.apache.org/jira/browse/SINGA-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangwei updated SINGA-132:
--------------------------
    Assignee: wangwei  (was: Haibo Chen)

> Optimize training on a single node with GPUs
> --------------------------------------------
>
>                 Key: SINGA-132
>                 URL: https://issues.apache.org/jira/browse/SINGA-132
>             Project: Singa
>          Issue Type: Improvement
>            Reporter: wangwei
>            Assignee: wangwei
>
> There are two training situations. 
> 1. a single worker. For this case, there is not need to launch a separate 
> server thread. Because it would lead to communication cost between the worker 
> and server. Instead, we can create an  Updater inside the Worker and call it 
> to update the parameters locally inside the Worker. The driver's working flow 
> should be changed for this case, i.e., there is no need to have a stub thread 
> and server thread. The worker should run in the main thread and the program 
> terminates once the worker finishes.
> 2. multiple worker. For this case, we need both workers and servers. First, 
> we can make zookeeper an optional dependent library, as it is used for Job ID 
> generation and termination condition check. If no Job ID is available, we can 
> always use the default Job ID (0). Since there is only one process, we don't 
> need zookeeper to know the status of workers in other processes. Second, the 
> communication between worker-stub-server should be optimized, e.g., using 
> GPU-Direct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to