[ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Issue Type: Technical task  (was: Sub-task)

> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Technical task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>         Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
> Push/Pull service: 
> In general, we could launch a parameter server inside (local multi-thread 
> backend) or outside (spark distributed backend) of CP to provide the pull and 
> push service. For the moment, all the weights and biases are saved in a 
> hashmap using a key, e.g., "global parameter". Each worker's gradients will 
> be put into the hashmap seperately with a given key. And the exchange between 
> server and workers will be implemented by TCP. Hence, we could easily 
> broadcast the IP address and the port number to the workers. And then the 
> workers can send the gradients and retrieve the new parameters via TCP 
> socket. The server will also spawn a thread which retrieves the gradients by 
> polling the hashmap using relevant keys and aggregates them. At last, it 
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter 
> server to be able to bring more parameter update strategies, e.g., the 
> stale-synchronous strategy needs a hyperparameter "staleness" to define the 
> waiting interval. The idea is to maintain a vector clock recording all 
> workers' clock in the server. Each time when an iteration in side of worker 
> finishes, it waits server to give a signal, i.e., to send a request for 
> calculating the staleness according to the vector clock. And when the server 
> receives the gradients from certain worker, it will increment the vector 
> clock for this worker. So we could define BSP as "staleness==0", ASP as 
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to