[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in format of struct. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in format of struct. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in struct format. (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model (which will be a struct-like data structure consisting the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). And the function will return a trained model in format of struct.) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Summary: Preparation of dev environment (was: Creation of a test dml script based on NN library) > Preparation of dev environment > -- > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. And a test dml script which leverages the new "paramserv" function > to rewrite the training function in the [MNIST LeNet > Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] > could be prepared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2298) Preparation of dev environment
[ https://issues.apache.org/jira/browse/SYSTEMML-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2298: Description: During the bonding time, all the development environment should be well prepared. The native library OpenBLAS should be installed in order to run the MNIST LeNet example. And then by leveraging the MNIST LeNet data generator ([http://leon.bottou.org/projects/infimnist]), we could generate 256k instances to train the model. (was: During the bonding time, all the development environment should be well prepared. And a test dml script which leverages the new "paramserv" function to rewrite the training function in the [MNIST LeNet Example|https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet.dml] could be prepared.) > Preparation of dev environment > -- > > Key: SYSTEMML-2298 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2298 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > During the bonding time, all the development environment should be well > prepared. The native library OpenBLAS should be installed in order to run the > MNIST LeNet example. And then by leveraging the MNIST LeNet data generator > ([http://leon.bottou.org/projects/infimnist]), we could generate 256k > instances to train the model. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2306) Implementation of a script with paramserv func
LI Guobao created SYSTEMML-2306: --- Summary: Implementation of a script with paramserv func Key: SYSTEMML-2306 URL: https://issues.apache.org/jira/browse/SYSTEMML-2306 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao This task aims to write a dml script consisting the paramserv function. We could easily reuse the MNIST LeNet example and adapt it by creating a struct-like model and passing the update function as well as the aggregation function. In this case, the update function which will be executed in workers should consist of calculating the gradients by walking the batch forward and backward steps. And the aggregation function which will be runned in parameter server should consist of updating the weights and biases by aggregating the received gradients. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Due Date: 17/May/18 (was: 21/May/18) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Due Date: 16/May/18 (was: 17/May/18) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, upd=fun1, mode=SYNC, freq=EPOCH, > agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are > interested in providing the model (which will be a struct-like data structure > consisting the weights, the biases and the hyperparameters), the training > features and labels, the validation features and labels, the batch update > function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or mini-batch), the > gradient aggregation function, the number of epoch, the batch size, the > degree of parallelism as well as the checkpointing strategy (e.g. rollback > recovery). And the function will return a trained model in struct format. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2306) Implementation of a script with paramserv func
[ https://issues.apache.org/jira/browse/SYSTEMML-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2306: Due Date: 18/May/18 > Implementation of a script with paramserv func > -- > > Key: SYSTEMML-2306 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2306 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This task aims to write a dml script consisting the paramserv function. We > could easily reuse the MNIST LeNet example and adapt it by creating a > struct-like model and passing the update function as well as the aggregation > function. In this case, the update function which will be executed in workers > should consist of calculating the gradients by walking the batch forward and > backward steps. And the aggregation function which will be runned in > parameter server should consist of updating the weights and biases by > aggregating the received gradients. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > # For the case of local multi-thread parameter server, it is easy to > maintain a concurrent hashmap (where the parameters as value accompanied with > a defined key) inside the CP. And the workers are launched in multi-threaded > way to execute the gradients calculation function and push the gradients to > the hashmap. An another thread will be launched to pull the gradients from > hashmap and call the aggregation function to update the parameters. > # For the case of spark distributed backend, we could launch a remote single > parameter server outside of CP (as a worker) to provide the pull and push > service. For the moment, all the weights and biases are saved in this single > server. And the exchange between server and workers will be implemented by > TCP. Hence, we could easily broadcast the IP address and the port number to > the workers. And then the workers can send the gradients and re
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. was:A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. The idea is to run a single-node parameter server by maintaining a hashmap inside the CP (Control Program) where the parameter as value accompanied with a defined key. For example, inserting the global parameter with a key named “worker-param-replica” allows the workers to retrieve the parameter replica. Hence, in the context of local multi-threaded backend, workers can communicate directly with this hashmap in the same process. And in the context of Spark distributed backend, the CP firstly needs to fork a thread to start a parameter server which maintains a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting to parameter server via TCP socket. Since SystemML has good cache management, we only need to maintain the matrix reference pointing to a file location instead of real data instance in the hashmap. If time permits, to be able to introduce the async and staleness update strategies, we would need to implement the synchronization by leveraging vector clock. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > # For the case of local multi-thread parameter server, it is easy to > maintain a concurrent hashmap (where the parameters as value accompanied with > a defined key) inside the CP. And the workers are launched in multi-threaded > way to execute the gradients calculation function and push the gradients to > the hashmap. An another thread will be launched to pull the gradients from > hashmap and call the aggregation function to update the parameters. > # For the case of spark distributed backend, we could launch a remote single > parameter server outside of CP (as a worker) to provide the pull and push > service. For the moment, all the weights and biases are saved in this single > server. And the exchange between server and workers will be implemented by > TCP. Hence, we could easily broadcast the IP address and the port number to > the workers. And then the workers can send the gradients and retrieve the new > parameters via TCP socket. > We could also need to implement the synchronisation between workers and > parameter server to be able to bring more parameter update strategies, e.g., > the stale-synchronous strategy needs a hyperparameter "staleness" to define > the waiting interval. The idea is to maintain a vector clock consisting of > all workers' clock in the server. Each time when an iteration finishes, the > worker will send a request to server and then the
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. It consists of the implementations of > partitioning the data for worker threads, launching the single-node parameter > server in CP, shipping and calling the compiled statistical function and > creating different update strategies. We will focus on > implementing BSP execution strategies, i.e., synchronous update strategy > including per epoch and per batch. And other update strategies (e.g. > asynchronous, stale-synchronous) and checkpointing strategies should be > optional and will be added if time permits. The architecture for synchronous > per epoch update strategy is illustrated below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. The idea is to spawn a thread to launch local parameter server which is responsible for maintaining the parameter hashmap and executing the aggregation work. And then a number of workers will be forked according to the level of parallelism. The worker loads data partition, operates the parameter updating per batch, pushes the gradients and retrieves a new parameter from server. The server will retrieve the gradients of each worker using the related keys in a round robin way, aggregate the parameters and push the new global parameter with the parameter related keys. At last, the paramserv function main thread should wait for the server aggregator thread joining it and got the last global parameters as final result. Hence, the pull/push primitive methods can bring more flexibility and facilitate to implement other update strategies. was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. It consists of the implementations of > partitioning the data for worker threads, launching the single-node parameter > server in CP, shipping and calling the compiled statistical function and > creating different update strategies. We will focus on > implementing BSP execution strategies, i.e., synchronous update strategy > including per epoch and per batch. And other update strategies (e.g. > asynchronous, stale-synchronous) and checkpointing strategies should be > optional and will be added if time permits. The architecture for synchronous > per epoch update strategy is illustrated below. > The idea is to spawn a thread to launch local parameter server which is > responsible for maintaining the parameter hashmap and executing the > aggregation work. And then a number of workers will be forked according to > the level of parallelism. The worker loads data partition, operates the > parameter updating per batch, pushes the gradients and retrieves a new > parameter from server. The server will retrieve the gradients of each worker > using the related keys in a round robin way, aggregate the parameters and > push the new global parameter with the parameter related keys. At last, the > paramserv function main thread should wait for the server aggregator thread > joining it and got the last global parameters as final result. Hence, the > pull/push primitive methods can bring more flexibility and facilitate to > implement other update strategies. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. Push/Pull service: In general, we could launch a parameter server inside (local multi-thread backend) or outside (spark distributed backend) of CP to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap. Synchronization: We also need to implement the synchronization between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock recording all workers' clock in the server. Each time when an iteration in side of worker finishes, it waits server to give a signal, i.e., to send a request for calculating the staleness according to the vector clock. And when the server receives the gradients from certain worker, it will increment the vector clock for this worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and SSP as "staleness==N". A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. A diagram of the parameter server architecture is shown below. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant k
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: ps.png > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: (was: ps.png) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to design and implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. It consists of the implementations of partitioning the data for worker threads, launching the single-node parameter server in CP, shipping and calling the compiled statistical function and creating different update strategies. We will focus on implementing BSP execution strategies, i.e., synchronous update strategy including per epoch and per batch. And other update strategies (e.g. asynchronous, stale-synchronous) and checkpointing strategies should be optional and will be added if time permits. The architecture for synchronous per epoch update strategy is illustrated below. The idea is to spawn a thread to launch local parameter server which is responsible for maintaining the parameter hashmap and executing the aggregation work. And then a number of workers will be forked according to the level of parallelism. The worker loads data partition, operates the parameter updating per batch, pushes the gradients and retrieves a new parameter from server. The server will retrieve the gradients of each worker using the related keys in a round robin way, aggregate the parameters and push the new global parameter with the parameter related keys. At last, the paramserv function main thread should wait for the server aggregator thread joining it and got the last global parameters as final result. Hence, the pull/push primitive methods can bring more flexibility and facilitate to implement other update strategies.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to design and implement a local execution backend for the > compiled “paramserv” function. The idea is to spawn a thread in CP for > running the parameter server. And the workers are also launched in > multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to design and implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement a local execution backend for the compiled > “paramserv” function. The idea is to spawn a thread in CP for running the > parameter server. And the workers are also launched in multi-threaded way in > CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2307) New structured data types
Matthias Boehm created SYSTEMML-2307: Summary: New structured data types Key: SYSTEMML-2307 URL: https://issues.apache.org/jira/browse/SYSTEMML-2307 Project: SystemML Issue Type: Epic Reporter: Matthias Boehm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2309) Length and right indexing operations over lists
Matthias Boehm created SYSTEMML-2309: Summary: Length and right indexing operations over lists Key: SYSTEMML-2309 URL: https://issues.apache.org/jira/browse/SYSTEMML-2309 Project: SystemML Issue Type: Sub-task Reporter: Matthias Boehm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2310) Length and right indexing operations over structs
Matthias Boehm created SYSTEMML-2310: Summary: Length and right indexing operations over structs Key: SYSTEMML-2310 URL: https://issues.apache.org/jira/browse/SYSTEMML-2310 Project: SystemML Issue Type: Sub-task Reporter: Matthias Boehm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2308) New data types list and struct
Matthias Boehm created SYSTEMML-2308: Summary: New data types list and struct Key: SYSTEMML-2308 URL: https://issues.apache.org/jira/browse/SYSTEMML-2308 Project: SystemML Issue Type: Sub-task Reporter: Matthias Boehm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2308) New data types list and struct, incl constructors
[ https://issues.apache.org/jira/browse/SYSTEMML-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Boehm updated SYSTEMML-2308: - Summary: New data types list and struct, incl constructors (was: New data types list and struct) > New data types list and struct, incl constructors > - > > Key: SYSTEMML-2308 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2308 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Description: This part aims to implement the BSP for spark distributed backend. Hence the idea is to be able to launch a remote parameter server and the workers. > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP for spark distributed backend. Hence the > idea is to be able to launch a remote parameter server and the workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2311) Allow lists and structs in function calls
Matthias Boehm created SYSTEMML-2311: Summary: Allow lists and structs in function calls Key: SYSTEMML-2311 URL: https://issues.apache.org/jira/browse/SYSTEMML-2311 Project: SystemML Issue Type: Sub-task Reporter: Matthias Boehm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2302) Second version of execution backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2302: Description: This part aims to complement the updating strategies by adding ASP and SSP. > Second version of execution backend > --- > > Key: SYSTEMML-2302 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This part aims to complement the updating strategies by adding ASP and SSP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Description: This part aims to implement the BSP strategy for the local execution backend. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP. (was: This part aims to implement a local execution backend for the compiled “paramserv” function. The idea is to spawn a thread in CP for running the parameter server. And the workers are also launched in multi-threaded way in CP.) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP strategy for the local execution backend. > The idea is to spawn a thread in CP for running the parameter server. And the > workers are also launched in multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Due Date: 25/May/18 (was: 28/May/18) > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to add an additional language support for the “paramserv” > function in order to be able to compile this new function. Since SystemML > already supports the parameterized builtin function, we can easily extend an > additional operation type and generate a new instruction for the “paramserv” > function. Recently, we have also added a new “eval” built-in function which > is capable to pass a function pointer as argument so that it can be called in > runtime. Similar to it, we would need to extend the inter-procedural analysis > to avoid removing unused constructed functions in the presence of > second-order “paramserv” function. Because the referenced functions, i.e., > the aggregate function and update function, should be present in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Due Date: 1/Jun/18 (was: 4/Jun/18) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2086) Initial version of local backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2086: Due Date: 22/Jun/18 (was: 25/Jun/18) > Initial version of local backend > > > Key: SYSTEMML-2086 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2086 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP strategy for the local execution backend. > The idea is to spawn a thread in CP for running the parameter server. And the > workers are also launched in multi-threaded way in CP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Due Date: 6/Jul/18 (was: 9/Jul/18) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the BSP for spark distributed backend. Hence the > idea is to be able to launch a remote parameter server and the workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2302) Second version of execution backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2302: Due Date: 27/Jul/18 (was: 6/Aug/18) > Second version of execution backend > --- > > Key: SYSTEMML-2302 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > This part aims to complement the updating strategies by adding ASP and SSP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2308) New data type list for lists and structs
[ https://issues.apache.org/jira/browse/SYSTEMML-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Boehm updated SYSTEMML-2308: - Summary: New data type list for lists and structs (was: New data types list and struct, incl constructors) > New data type list for lists and structs > > > Key: SYSTEMML-2308 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2308 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)