[ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Description: A single node parameter server acts as a data-parallel 
parameter server. And a multi-node model parallel parameter server will be 
discussed if time permits. The idea is to run a single-node parameter server by 
maintaining a hashmap inside the CP (Control Program) where the parameter as 
value accompanied with a defined key. For example, inserting the global 
parameter with a key named “worker-param-replica” allows the workers to 
retrieve the parameter replica. Hence, in the context of local multi-threaded 
backend, workers can communicate directly with this hashmap in the same 
process. And in the context of Spark distributed backend, the CP firstly needs 
to fork a thread to start a parameter server which maintains a hashmap. And 
secondly the workers can send intermediates and retrieve parameters by 
connecting to parameter server via TCP socket. Since SystemML has good cache 
management, we only need to maintain the matrix reference pointing to a file 
location instead of real data instance in the hashmap. If time permits, to be 
able to introduce the async and staleness update strategies, we would need to 
implement the synchronization by leveraging vector clock.  (was: A single node 
parameter server acts as a data-parallel parameter server. And a multi-node 
model parallel parameter server will be discussed if time permits. )

> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. The idea is to run a single-node parameter server by maintaining a 
> hashmap inside the CP (Control Program) where the parameter as value 
> accompanied with a defined key. For example, inserting the global parameter 
> with a key named “worker-param-replica” allows the workers to retrieve the 
> parameter replica. Hence, in the context of local multi-threaded backend, 
> workers can communicate directly with this hashmap in the same process. And 
> in the context of Spark distributed backend, the CP firstly needs to fork a 
> thread to start a parameter server which maintains a hashmap. And secondly 
> the workers can send intermediates and retrieve parameters by connecting to 
> parameter server via TCP socket. Since SystemML has good cache management, we 
> only need to maintain the matrix reference pointing to a file location 
> instead of real data instance in the hashmap. If time permits, to be able to 
> introduce the async and staleness update strategies, we would need to 
> implement the synchronization by leveraging vector clock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to