[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Attachment: ps.png > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. The idea is to run a single-node parameter server by maintaining a > hashmap inside the CP (Control Program) where the parameter as value > accompanied with a defined key. For example, inserting the global parameter > with a key named “worker-param-replica” allows the workers to retrieve the > parameter replica. Hence, in the context of local multi-threaded backend, > workers can communicate directly with this hashmap in the same process. And > in the context of Spark distributed backend, the CP firstly needs to fork a > thread to start a parameter server which maintains a hashmap. And secondly > the workers can send intermediates and retrieve parameters by connecting to > parameter server via TCP socket. Since SystemML has good cache management, we > only need to maintain the matrix reference pointing to a file location > instead of real data instance in the hashmap. If time permits, to be able to > introduce the async and staleness update strategies, we would need to > implement the synchronization by leveraging vector clock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. The idea is to run a single-node parameter server by maintaining a hashmap inside the CP (Control Program) where the parameter as value accompanied with a defined key. For example, inserting the global parameter with a key named “worker-param-replica” allows the workers to retrieve the parameter replica. Hence, in the context of local multi-threaded backend, workers can communicate directly with this hashmap in the same process. And in the context of Spark distributed backend, the CP firstly needs to fork a thread to start a parameter server which maintains a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting to parameter server via TCP socket. Since SystemML has good cache management, we only need to maintain the matrix reference pointing to a file location instead of real data instance in the hashmap. If time permits, to be able to introduce the async and staleness update strategies, we would need to implement the synchronization by leveraging vector clock. (was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. ) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. The idea is to run a single-node parameter server by maintaining a > hashmap inside the CP (Control Program) where the parameter as value > accompanied with a defined key. For example, inserting the global parameter > with a key named “worker-param-replica” allows the workers to retrieve the > parameter replica. Hence, in the context of local multi-threaded backend, > workers can communicate directly with this hashmap in the same process. And > in the context of Spark distributed backend, the CP firstly needs to fork a > thread to start a parameter server which maintains a hashmap. And secondly > the workers can send intermediates and retrieve parameters by connecting to > parameter server via TCP socket. Since SystemML has good cache management, we > only need to maintain the matrix reference pointing to a file location > instead of real data instance in the hashmap. If time permits, to be able to > introduce the async and staleness update strategies, we would need to > implement the synchronization by leveraging vector clock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. (was: Parameter server allows to persist the model parameters in a distributed manner. It is specially applied in the context of large-scale machine learning to train the model. The parameters computation will be done with data parallelism across the workers. The data-parallel parameter server architecture is illustrated in Figure 2. With the help of a lightweight parameter server interface [1], we are inspired to provide the push and pull methods as internal primitives, i.e., not exposed to the script level, allowing to exchange the intermediates among workers.) > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2085) Single-node parameter server primitives
[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2085: Description: Parameter server allows to persist the model parameters in a distributed manner. It is specially applied in the context of large-scale machine learning to train the model. The parameters computation will be done with data parallelism across the workers. The data-parallel parameter server architecture is illustrated in Figure 2. With the help of a lightweight parameter server interface [1], we are inspired to provide the push and pull methods as internal primitives, i.e., not exposed to the script level, allowing to exchange the intermediates among workers. > Single-node parameter server primitives > --- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > Parameter server allows to persist the model parameters in a distributed > manner. It is specially applied in the context of large-scale machine > learning to train the model. The parameters computation will be done with > data parallelism across the workers. The data-parallel parameter server > architecture is illustrated in Figure 2. With the help > of a lightweight parameter server interface [1], we are inspired to provide > the push and pull methods as internal primitives, i.e., not exposed to the > script level, allowing to exchange the intermediates among workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2084) Implementation of language and compiler extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2084: Description: This part aims to add an additional language support for the “paramserv” function in order to be able to compile this new function. Since SystemML already supports the parameterized builtin function, we can easily extend an additional operation type and generate a new instruction for the “paramserv” function. Recently, we have also added a new “eval” built-in function which is capable to pass a function pointer as argument so that it can be called in runtime. Similar to it, we would need to extend the inter-procedural analysis to avoid removing unused constructed functions in the presence of second-order “paramserv” function. Because the referenced functions, i.e., the aggregate function and update function, should be present in runtime. > Implementation of language and compiler extension > - > > Key: SYSTEMML-2084 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2084 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to add an additional language support for the “paramserv” > function in order to be able to compile this new function. Since SystemML > already supports the parameterized builtin function, we can easily extend an > additional operation type and generate a new instruction for the “paramserv” > function. Recently, we have also added a new “eval” built-in function which > is capable to pass a function pointer as argument so that it can be called in > runtime. Similar to it, we would need to extend the inter-procedural analysis > to avoid removing unused constructed functions in the presence of > second-order “paramserv” function. Because the referenced functions, i.e., > the aggregate function and update function, should be present in runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, checkpointing=rollback)_. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be > _model'=paramserv(model, X, y, X_val, y_val, g_cal_fun, upd=fun1, mode=SYNC, > freq=EPOCH, agg=fun2, epochs=100, batchsize=64, k=7, > checkpointing=rollback)_. We are interested in providing the model, the > training features and labels, the validation features and labels, the > gradient calculation function, the batch update function, the update strategy > (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. > epoch or batch), the aggregation function, the number of epoch, the batch > size, the degree of parallelism as well as the checkpointing strategy (e.g. > rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the gradient calculation function, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). (was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery).) > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature is > illustrated in Figure 1. We are interested in providing the model, the > training features and labels, the validation features and labels, the > gradient calculation function, the batch update function, the update strategy > (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. > epoch or batch), the aggregation function, the number of epoch, the batch > size, the degree of parallelism as well as the checkpointing strategy (e.g. > rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature is illustrated in Figure 1. We are interested in providing the model, the training features and labels, the validation features and labels, the batch update function, the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or batch), the aggregation function, the number of epoch, the batch size, the degree of parallelism as well as the checkpointing strategy (e.g. rollback recovery). > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature is > illustrated in Figure 1. We are interested in providing the model, the > training features and labels, the validation features and labels, the batch > update function, the update strategy (e.g. sync, async, hogwild!, > stale-synchronous), the update frequency (e.g. epoch or batch), the > aggregation function, the number of epoch, the batch size, the degree of > parallelism as well as the checkpointing strategy (e.g. rollback recovery). -- This message was sent by Atlassian JIRA (v7.6.3#76005)