[jira] [Resolved] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing

2018-09-11 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2446.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Paramserv adagrad ASP batch disjoint_continuous failing
> ---
>
> Key: SYSTEMML-2446
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2446
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> {code}
> Caused by: java.io.IOException: File 
> scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on 
> HDFS/LFS.
> at 
> org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120)
> at 
> org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2304) Submit final product

2018-08-14 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2304.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Submit final product
> 
>
> Key: SYSTEMML-2304
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2304
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2302) Second version of execution backend

2018-08-10 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2302.
---
   Resolution: Invalid
Fix Version/s: SystemML 1.2

> Second version of execution backend
> ---
>
> Key: SYSTEMML-2302
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2302
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> This part aims to complement the updating strategies by adding ASP and SSP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2458) Add experiment on spark paramserv

2018-08-09 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2458.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Add experiment on spark paramserv
> -
>
> Key: SYSTEMML-2458
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2458
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2090) Documentation of language extension

2018-08-06 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2090.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Documentation of language extension
> ---
>
> Key: SYSTEMML-2090
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2090
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function

2018-08-05 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2299:

Description: 
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, 
val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", 
freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", 
hyperparams=params, checkpointing="NONE"){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * features : training features matrix
 * labels : training label matrix
 * val_features  [optional]: validation features matrix
 * val_labels  [optional]: validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: LOCAL, REMOTE_SPARK): the execution backend where 
the parameter is executed
 * utype  (options: BSP, ASP, SSP): the updating mode
 * freq  [optional] (default: BATCH) (options: EPOCH, BATCH) : the 
frequence of updates
 * epochs : the number of epoch
 * batchsize  [optional] (default: 64): the size of batch, if the 
update frequence is "EPOCH", this argument will be ignored
 * k  [optional] (default: number of vcores, otherwise vcores / 2 if 
using openblas): the degree of parallelism
 * scheme  [optional] (default: disjoint_contiguous) (options: 
disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): 
the scheme of data partition, i.e., how the data is distributed across workers
 * hyperparams  [optional]: a list consisting of the additional hyper 
parameters, e.g., learning rate, momentum
 * checkpointing [optional] (default: NONE) (options: NONE, EPOCH, 
EPOCH10) : the checkpoint strategy, we could set a checkpoint for each epoch or 
each 10 epochs 

*Output*:
 * model' : a list consisting of the updated weight and bias matrices

  was:
The objective of “paramserv” built-in function is to update an initial or 
existing model with configuration. An initial function signature would be: 
{code:java}
model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, 
val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", 
freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", 
hyperparams=params, checkpointing="NONE"){code}
We are interested in providing the model (which will be a struct-like data 
structure consisting of the weights, the biases and the hyperparameters), the 
training features and labels, the validation features and labels, the batch 
update function (i.e., gradient calculation func), the update strategy (e.g. 
sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or 
mini-batch), the gradient aggregation function, the number of epoch, the batch 
size, the degree of parallelism, the data partition scheme, a list of 
additional hyper parameters, as well as the checkpointing strategy. And the 
function will return a trained model in struct format.

*Inputs*:
 * model : a list consisting of the weight and bias matrices
 * features : training features matrix
 * labels : training label matrix
 * val_features : validation features matrix
 * val_labels : validation label matrix
 * upd : the name of gradient calculation function
 * agg : the name of gradient aggregation function
 * mode  (options: LOCAL, REMOTE_SPARK): the execution backend where 
the parameter is executed
 * utype  (options: BSP, ASP, SSP): the updating mode
 * freq  [optional] (default: BATCH) (options: EPOCH, BATCH) : the 
frequence of updates
 * epochs : the number of epoch
 * batchsize  [optional] (default: 64): the size of batch, if the 
update frequence is "EPOCH", this argument will be ignored
 * k  [optional] (default: number of vcores, otherwise vcores / 2 if 
using openblas): the degree of parallelism
 * scheme  [optional] (default: disjoint_contiguous) (options: 
disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): 
the scheme of data partition, i.e., how the data is distributed across workers
 * hyperparams  [optional]: a list consisting 

[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv

2018-08-05 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569424#comment-16569424
 ] 

LI Guobao commented on SYSTEMML-2458:
-

[~mboehm7], yes, I added the baseline experiment w/o paramserv and fixed the 
location of SystemML-config.xml file. Addtionnally, I've double checked the 
configuration of native BLAS for remote worker and it is well transferred and 
set to remote worker.

> Add experiment on spark paramserv
> -
>
> Key: SYSTEMML-2458
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2458
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv

2018-08-04 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569312#comment-16569312
 ] 

LI Guobao commented on SYSTEMML-2458:
-

[~mboehm7], for the reason of hoping to have some experiments result for the 
presentation, I have pushed the latest polished scripts and the new packaged 
jar with the recent patches. Maybe we could continue to launch the experiments?

> Add experiment on spark paramserv
> -
>
> Key: SYSTEMML-2458
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2458
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-03 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2482.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. However, 
> the matrices referenced by the list should be pinned to avoid being cleaned 
> up. And this issue is related to 
> [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567477#comment-16567477
 ] 

LI Guobao commented on SYSTEMML-2482:
-

I just saw your latest commit. Thanks for helping me. And yes. So we keep the 
current behavior.

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. However, 
> the matrices referenced by the list should be pinned to avoid being cleaned 
> up. And this issue is related to 
> [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567470#comment-16567470
 ] 

LI Guobao commented on SYSTEMML-2482:
-

[~mboehm7] well, sorry about the unspecific description. Actually, I just found 
that the data status in list object is no longer used (i.e., null or array of 
false). Because, before that commit, all the matrices of the list output will 
be pinned in the vars table and the pinned status will be saved in this boolean 
array. And I just fix the problem of eviction by changing the logic of 
cleanning up the list object with its data status.

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. However, 
> the matrices referenced by the list should be pinned to avoid being cleaned 
> up. And this issue is related to 
> [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2482:

Description: Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution. However, 
the matrices referenced by the list should be pinned to avoid being cleaned up. 
And this issue is related to 
[SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481]   (was: 
Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution. And this 
issue is related to 
[SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] )

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. However, 
> the matrices referenced by the list should be pinned to avoid being cleaned 
> up. And this issue is related to 
> [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2482:

Description: Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution. And this 
issue is related to 
[SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481]   (was: 
Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution. And this 
issue is related to 
[ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481] )

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. And this 
> issue is related to 
> [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2482:

Description: Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution. And this 
issue is related to 
[ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481]   (was: Some 
unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution.)

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution. And this 
> issue is related to 
> [ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2482:

Description: Some unexpected overhead occurred when running the 
{{*testParamservASPEpochDisjointContiguous*}} in test 
{{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
 It took more time to finish the test in the case that the output of 
instruction is a list which will be cleaned up after the execution.  (was: Some 
unexpected overhead occurred when running the 
{{testParamservASPEpochDisjointContiguous}} in test 
{{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}. 
It took more time to finish the test in the case that the output of instruction 
is a list which will be cleaned up after the execution.)

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{*testParamservASPEpochDisjointContiguous*}} in test 
> {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2482:

Description: Some unexpected overhead occurred when running the 
{{testParamservASPEpochDisjointContiguous}} in test 
{{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}. 
It took more time to finish the test in the case that the output of instruction 
is a list which will be cleaned up after the execution.

> Unexpected cleanup of list object
> -
>
> Key: SYSTEMML-2482
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> Some unexpected overhead occurred when running the 
> {{testParamservASPEpochDisjointContiguous}} in test 
> {{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}.
>  It took more time to finish the test in the case that the output of 
> instruction is a list which will be cleaned up after the execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2482) Unexpected cleanup of list object

2018-08-02 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2482:
---

 Summary: Unexpected cleanup of list object
 Key: SYSTEMML-2482
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2482
 Project: SystemML
  Issue Type: Bug
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func

2018-08-01 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2478:

Description: 
When using parfor inside update function, some MR tasks are launched to write 
the output of task. And it took more time to finish the paramserv run than 
without parfor in update function. The scenario is to launch the ASP Epoch DC 
spark paramserv test.
Here is the stack:
{code:java}
Total elapsed time: 101.804 sec.
Total compilation time: 3.690 sec.
Total execution time:   98.114 sec.
Number of compiled Spark inst:  302.
Number of executed Spark inst:  540.
Cache hits (Mem, WB, FS, HDFS): 57839/0/0/240.
Cache writes (WB, FS, HDFS):14567/58/61.
Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec.
HOP DAGs recompiled (PRED, SB): 0/144.
HOP DAGs recompile time:0.507 sec.
Functions recompiled:   16.
Functions recompile time:   0.064 sec.
Spark ctx create time (lazy):   1.376 sec.
Spark trans counts (par,bc,col):270/1/240.
Spark trans times (par,bc,col): 0.573/0.197/42.255 secs.
Paramserv total num workers:3.
Paramserv setup time:   1.559 secs.
Paramserv grad compute time:105.701 secs.
Paramserv model update time:56.801/47.193 secs.
Paramserv model broadcast time: 23.872 secs.
Paramserv batch slice time: 0.000 secs.
Paramserv RPC request time: 105.159 secs.
ParFor loops optimized: 1.
ParFor optimize time:   0.040 sec.
ParFor initialize time: 0.434 sec.
ParFor result merge time:   0.005 sec.
ParFor total update in-place:   0/7/7
Total JIT compile time: 68.384 sec.
Total JVM GC count: 1120.
Total JVM GC time:  22.338 sec.
Heavy hitter instructions:
  #  Instruction Time(s)  Count
  1  paramserv97.221  1
  2  conv2d_bias_add  60.581614
  3  *54.990  12447
  4  sp_- 20.625240
  5  -17.979   7287
  6  +14.191  12824
  7  r'5.636   1200
  8  conv2d_backward_filter5.123600
  9  max   4.985907
 10  ba+*  4.591   1814

{code}

Here is the polished update func:

{code:java}
aggregation = function(list[unknown] model,
   list[unknown] gradients,
   list[unknown] hyperparams)
   return (list[unknown] modelResult) {
 lr = as.double(as.scalar(hyperparams["lr"]))
 mu = as.double(as.scalar(hyperparams["mu"]))

 modelResult = model

 # Optimize with SGD w/ Nesterov momentum
 parfor(i in 1:8, check=0) {
   P = as.matrix(model[i])
   dP = as.matrix(gradients[i])
   vP = as.matrix(model[8+i])
   [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP)
   modelResult[i] = P
   modelResult[8+i] = vP
 }
   }
{code}

[~mboehm7], in fact, I have no idea where the cause comes from? It seems that 
it tried to write the parfor task output into HDFS. So is it the normal 
behavior?

  was:
When using parfor inside update function, some MR tasks are launched to write 
the output of task. And it took more time to finish the paramserv run than 
without parfor in update function. The scenario is to launch the ASP Epoch DC 
spark paramserv test.
Here is the stack:
{code:java}
Total elapsed time: 101.804 sec.
Total compilation time: 3.690 sec.
Total execution time:   98.114 sec.
Number of compiled Spark inst:  302.
Number of executed Spark inst:  540.
Cache hits (Mem, WB, FS, HDFS): 57839/0/0/*240*.
Cache writes (WB, FS, HDFS):14567/58/61.
Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec.
HOP DAGs recompiled (PRED, SB): 0/144.
HOP DAGs recompile time:0.507 sec.
Functions recompiled:   16.
Functions recompile time:   0.064 sec.
Spark ctx create time (lazy):   1.376 sec.
Spark trans counts (par,bc,col):270/1/240.
Spark trans times (par,bc,col): 0.573/0.197/42.255 secs.
Paramserv total num workers:3.
Paramserv setup time:   1.559 secs.
Paramserv grad compute time:105.701 secs.
Paramserv model update time:56.801/47.193 secs.
Paramserv model broadcast time: 23.872 secs.
Paramserv batch slice time: 0.000 secs.
Paramserv RPC request time: 105.159 secs.
ParFor loops optimized: 1.
ParFor optimize time:   0.040 sec.
ParFor initialize time: 0.434 sec.
ParFor result merge time:   0.005 sec.
ParFor total update in-place:   0/7/7
Total JIT compile time: 68.384 sec.
Total JVM GC count: 1120.
Total JVM GC time:  22.338 sec.
Heavy hitter instructions:
  #  Instruction Time(s)  Count
  1  paramserv97.221  1
  2  conv2d_bias_add  60.581614
  3  *54.990  12447
  4  sp_- 20.625

[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func

2018-08-01 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2478:

Description: 
When using parfor inside update function, some MR tasks are launched to write 
the output of task. And it took more time to finish the paramserv run than 
without parfor in update function. The scenario is to launch the ASP Epoch DC 
spark paramserv test.
Here is the stack:
{code:java}
Total elapsed time: 101.804 sec.
Total compilation time: 3.690 sec.
Total execution time:   98.114 sec.
Number of compiled Spark inst:  302.
Number of executed Spark inst:  540.
Cache hits (Mem, WB, FS, HDFS): 57839/0/0/*240*.
Cache writes (WB, FS, HDFS):14567/58/61.
Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec.
HOP DAGs recompiled (PRED, SB): 0/144.
HOP DAGs recompile time:0.507 sec.
Functions recompiled:   16.
Functions recompile time:   0.064 sec.
Spark ctx create time (lazy):   1.376 sec.
Spark trans counts (par,bc,col):270/1/240.
Spark trans times (par,bc,col): 0.573/0.197/42.255 secs.
Paramserv total num workers:3.
Paramserv setup time:   1.559 secs.
Paramserv grad compute time:105.701 secs.
Paramserv model update time:56.801/47.193 secs.
Paramserv model broadcast time: 23.872 secs.
Paramserv batch slice time: 0.000 secs.
Paramserv RPC request time: 105.159 secs.
ParFor loops optimized: 1.
ParFor optimize time:   0.040 sec.
ParFor initialize time: 0.434 sec.
ParFor result merge time:   0.005 sec.
ParFor total update in-place:   0/7/7
Total JIT compile time: 68.384 sec.
Total JVM GC count: 1120.
Total JVM GC time:  22.338 sec.
Heavy hitter instructions:
  #  Instruction Time(s)  Count
  1  paramserv97.221  1
  2  conv2d_bias_add  60.581614
  3  *54.990  12447
  4  sp_- 20.625240
  5  -17.979   7287
  6  +14.191  12824
  7  r'5.636   1200
  8  conv2d_backward_filter5.123600
  9  max   4.985907
 10  ba+*  4.591   1814

{code}

Here is the polished update func:

{code:java}
aggregation = function(list[unknown] model,
   list[unknown] gradients,
   list[unknown] hyperparams)
   return (list[unknown] modelResult) {
 lr = as.double(as.scalar(hyperparams["lr"]))
 mu = as.double(as.scalar(hyperparams["mu"]))

 modelResult = model

 # Optimize with SGD w/ Nesterov momentum
 parfor(i in 1:8, check=0) {
   P = as.matrix(model[i])
   dP = as.matrix(gradients[i])
   vP = as.matrix(model[8+i])
   [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP)
   modelResult[i] = P
   modelResult[8+i] = vP
 }
   }
{code}

[~mboehm7], in fact, I have no idea where the cause comes from? It seems that 
it tried to write the parfor task output into HDFS. So is it the normal 
behavior?

  was:When using parfor inside update function, some MR tasks 


> Overhead when using parfor in update func
> -
>
> Key: SYSTEMML-2478
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2478
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> When using parfor inside update function, some MR tasks are launched to write 
> the output of task. And it took more time to finish the paramserv run than 
> without parfor in update function. The scenario is to launch the ASP Epoch DC 
> spark paramserv test.
> Here is the stack:
> {code:java}
> Total elapsed time:   101.804 sec.
> Total compilation time:   3.690 sec.
> Total execution time: 98.114 sec.
> Number of compiled Spark inst:302.
> Number of executed Spark inst:540.
> Cache hits (Mem, WB, FS, HDFS):   57839/0/0/*240*.
> Cache writes (WB, FS, HDFS):  14567/58/61.
> Cache times (ACQr/m, RLS, EXP):   42.346/0.064/4.761/20.280 sec.
> HOP DAGs recompiled (PRED, SB):   0/144.
> HOP DAGs recompile time:  0.507 sec.
> Functions recompiled: 16.
> Functions recompile time: 0.064 sec.
> Spark ctx create time (lazy): 1.376 sec.
> Spark trans counts (par,bc,col):270/1/240.
> Spark trans times (par,bc,col):   0.573/0.197/42.255 secs.
> Paramserv total num workers:  3.
> Paramserv setup time: 1.559 secs.
> Paramserv grad compute time:  105.701 secs.
> Paramserv model update time:  56.801/47.193 secs.
> Paramserv model broadcast time:   23.872 secs.
> Paramserv batch slice time:   0.000 secs.
> Paramserv RPC request time:   105.159 secs.
> ParFor loops optimized:   1.
> ParFor optimize time: 0.040 sec.
> ParFor initializ

[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func

2018-08-01 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2478:

Summary: Overhead when using parfor in update func  (was: Unexpected MR 
task when using parfor)

> Overhead when using parfor in update func
> -
>
> Key: SYSTEMML-2478
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2478
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> When using parfor inside update function, some MR tasks 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2478) Unexpected MR task when using parfor

2018-08-01 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2478:

Description: When using parfor inside update function, some MR tasks 

> Unexpected MR task when using parfor
> 
>
> Key: SYSTEMML-2478
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2478
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> When using parfor inside update function, some MR tasks 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2478) Unexpected MR task when using parfor

2018-08-01 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2478:
---

 Summary: Unexpected MR task when using parfor
 Key: SYSTEMML-2478
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2478
 Project: SystemML
  Issue Type: Bug
Reporter: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2477) NPE when copying list object

2018-08-01 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2477.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> NPE when copying list object
> 
>
> Key: SYSTEMML-2477
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2477
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2477) NPE when copying list object

2018-08-01 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2477:
---

 Summary: NPE when copying list object
 Key: SYSTEMML-2477
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2477
 Project: SystemML
  Issue Type: Bug
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2476) Unexpected mapreduce task

2018-07-31 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2476:

Description: 
When trying to use scalar casting to get element from a list, unexpected 
mapreduce tasks are launched instead of CP mode. The scenario is to replace *C 
= 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} 
found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. 
And then the problem could be reproduced by launching the method 
{{_testParamservBSPBatchDisjointContiguous_}} inside class 
_{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_

Here is the stack:
{code:java}
18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1
18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080
18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008
{code}

[~mboehm7], if possible, could you take a look on this? And I've double checked 
the creation of execution context in {{ParamservBuiltinCPInstruction}}. But it 
is instance of ExecutionContext not SparkExecutionContext.


  was:
When trying to use scalar casting to get element from a list, unexpected 
mapreduce tasks are launched instead of CP mode. The scenario is to replace *C 
= 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} 
found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. 
And then the problem could be reproduced by launching the method 
{{_testParamservBSPBatchDisjointContiguous_}} inside class 
_{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_

Here is the stack:
{code:java}
18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1
18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080
18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008
{code}



> Unexpected mapreduce task
> -
>
> Key: SYSTEMML-2476
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2476
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Priority: Major
>
> When trying to use scalar casting to get element from a list, unexpected 
> mapreduce tasks are launched instead of CP mode. The scenario is to replace 
> *C = 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient 
> function_}} found in 
> {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. And 
> then the problem could be reproduced by launching the method 
> {{_testParamservBSPBatchDisjointContiguous_}} inside class 
> _{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_
> Here is the stack:
> {code:java}
> 18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1
> 18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080
> 18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: 
> http://localhost:8080/
> 18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008
> {code}
> [~mboehm7], if possible, could you take a look on this? And I've double 
> checked the creation of execution context in 
> {{ParamservBuiltinCPInstruction}}. But it is instance of ExecutionContext not 
> SparkExecutionContext.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2476) Unexpected mapreduce task

2018-07-31 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2476:
---

 Summary: Unexpected mapreduce task
 Key: SYSTEMML-2476
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2476
 Project: SystemML
  Issue Type: Bug
Reporter: LI Guobao


When trying to use scalar casting to get element from a list, unexpected 
mapreduce tasks are launched instead of CP mode. The scenario is to replace *C 
= 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} 
found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. 
And then the problem could be reproduced by launching the method 
{{_testParamservBSPBatchDisjointContiguous_}} inside class 
_{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_

Here is the stack:
{code:java}
18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1
18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080
18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2469) Large distributed paramserv overheads

2018-07-31 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2469.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Large distributed paramserv overheads
> -
>
> Key: SYSTEMML-2469
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
> Project: SystemML
>  Issue Type: Bug
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> Initial runs with the distributed paramserv implementation on a small cluster 
> revealed that it is working correctly while exhibiting large overheads. Below 
> are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster 
> of 1+6 nodes (24 cores per worker node). 
> {code}
> otal elapsed time: 687.743 sec.
> Total compilation time: 3.815 sec.
> Total execution time:   683.928 sec.
> Number of compiled Spark inst:  330.
> Number of executed Spark inst:  0.
> Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
> Cache writes (WB, FS, HDFS):29856/5271/0.
> Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/1629.
> HOP DAGs recompile time:4.878 sec.
> Functions recompiled:   1.
> Functions recompile time:   0.097 sec.
> Spark ctx create time (lazy):   22.222 sec.
> Spark trans counts (par,bc,col):2/1/0.
> Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
> Paramserv total num workers:144.
> Paramserv setup time:   68.259 secs.
> Paramserv grad compute time:6952.163 secs.
> Paramserv model update time:2453.448/422.955 secs.
> Paramserv model broadcast time: 24.982 secs.
> Paramserv batch slice time: 0.204 secs.
> Paramserv RPC request time: 51611.210 secs.
> ParFor loops optimized: 1.
> ParFor optimize time:   0.462 sec.
> ParFor initialize time: 0.049 sec.
> ParFor result merge time:   0.028 sec.
> ParFor total update in-place:   0/188/188
> Total JIT compile time: 98.786 sec.
> Total JVM GC count: 68.
> Total JVM GC time:  25.858 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  paramserv665.479  1
>   2  +182.410  18636
>   3  conv2d_bias_add  150.938376
>   4  sqrt  69.768  11528
>   5  / 54.836  11732
>   6  ba+*  45.901376
>   7  * 38.046  11727
>   8  - 37.428  12096
>   9  ^235.533   6344
>  10  exp   21.022188
> {code}
> There seem to be three distinct issues:
> * Too larger number of tasks on assembling the distributed input data (in the 
> number of rows, i.e., >50,000 tasks), which makes the distributed data 
> partitioning very slow (multiple minutes).
> * Evictions from the buffer pool at the driver node (see cache writes). This 
> is likely due to disabling cleanup (and missing explicit cleanup) of all RPC 
> objects.
> * Large RPC overhead: This might be due to the evictions happening in the 
> critical path and all 144 workers waiting with their RPC requests. However, 
> in addition we should also double check that the number of RPC handler 
> threads is correct, if we could get the serialization and communication out 
> of the critical (i.e., synchronized) path of model updates, and address 
> unnecessary serialization/deserialization overheads.
> [~Guobao] I'll help reducing the serialization/deserialization overheads, but 
> it would be great if you could have a look into the other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2471) Add java doc

2018-07-31 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2471.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Add java doc
> 
>
> Key: SYSTEMML-2471
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2471
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2424) Determine the level of par

2018-07-30 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2424.
---
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Determine the level of par
> --
>
> Key: SYSTEMML-2424
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2424
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> It aims to determine the parallelism level according to the cluster resource, 
> i.e., the total number of vcores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2471) Add java doc

2018-07-30 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2471:
---

 Summary: Add java doc
 Key: SYSTEMML-2471
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2471
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2087) Initial version of distributed spark backend

2018-07-30 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2087.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> This part aims to implement the parameter server for spark distributed 
> backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2420) Communication between ps and workers

2018-07-30 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2420.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage netty framework to implement our own Rpc framework. In general, the 
> netty {{TransportClient}} and {{TransportServer}} provides the sending and 
> receiving service for ps and workers. Extending the {{RpcHandler}} allows to 
> invoke the corresponding ps method (i.e., push/pull method) by handling the 
> different input Rpc call object. And then the {{SparkPsProxy}} wrapping 
> {{TransportClient}} allows the workers to execute the push/pull call to 
> server. At the same time, the ps netty server also provides the file 
> repository service which allows the workers to download the partitioned 
> training data, so that the workers could rebuild the matrix object with the 
> transfered file instead of broadcasting all the files with spark which are 
> not all necessary for each worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SYSTEMML-2469) Large distributed paramserv overheads

2018-07-28 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2469:
---

Assignee: LI Guobao

> Large distributed paramserv overheads
> -
>
> Key: SYSTEMML-2469
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
> Project: SystemML
>  Issue Type: Bug
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> Initial runs with the distributed paramserv implementation on a small cluster 
> revealed that it is working correctly while exhibiting large overheads. Below 
> are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster 
> of 1+6 nodes (24 cores per worker node). 
> {code}
> otal elapsed time: 687.743 sec.
> Total compilation time: 3.815 sec.
> Total execution time:   683.928 sec.
> Number of compiled Spark inst:  330.
> Number of executed Spark inst:  0.
> Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
> Cache writes (WB, FS, HDFS):29856/5271/0.
> Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/1629.
> HOP DAGs recompile time:4.878 sec.
> Functions recompiled:   1.
> Functions recompile time:   0.097 sec.
> Spark ctx create time (lazy):   22.222 sec.
> Spark trans counts (par,bc,col):2/1/0.
> Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
> Paramserv total num workers:144.
> Paramserv setup time:   68.259 secs.
> Paramserv grad compute time:6952.163 secs.
> Paramserv model update time:2453.448/422.955 secs.
> Paramserv model broadcast time: 24.982 secs.
> Paramserv batch slice time: 0.204 secs.
> Paramserv RPC request time: 51611.210 secs.
> ParFor loops optimized: 1.
> ParFor optimize time:   0.462 sec.
> ParFor initialize time: 0.049 sec.
> ParFor result merge time:   0.028 sec.
> ParFor total update in-place:   0/188/188
> Total JIT compile time: 98.786 sec.
> Total JVM GC count: 68.
> Total JVM GC time:  25.858 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  paramserv665.479  1
>   2  +182.410  18636
>   3  conv2d_bias_add  150.938376
>   4  sqrt  69.768  11528
>   5  / 54.836  11732
>   6  ba+*  45.901376
>   7  * 38.046  11727
>   8  - 37.428  12096
>   9  ^235.533   6344
>  10  exp   21.022188
> {code}
> There seem to be three distinct issues:
> * Too larger number of tasks on assembling the distributed input data (in the 
> number of rows, i.e., >50,000 tasks), which makes the distributed data 
> partitioning very slow (multiple minutes).
> * Evictions from the buffer pool at the driver node (see cache writes). This 
> is likely due to disabling cleanup (and missing explicit cleanup) of all RPC 
> objects.
> * Large RPC overhead: This might be due to the evictions happening in the 
> critical path and all 144 workers waiting with their RPC requests. However, 
> in addition we should also double check that the number of RPC handler 
> threads is correct, if we could get the serialization and communication out 
> of the critical (i.e., synchronized) path of model updates, and address 
> unnecessary serialization/deserialization overheads.
> [~Guobao] I'll help reducing the serialization/deserialization overheads, but 
> it would be great if you could have a look into the other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1

2018-07-28 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2466.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Distributed paramserv fails on newer Spark version > 2.1
> 
>
> Key: SYSTEMML-2466
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2466
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/network/util/SystemPropertyConfigProvider
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163)
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> at 
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106)
> at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487)
> at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272)
> at org.apache.sysml.api.DMLScript.main(DMLScript.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.util.SystemPropertyConfigProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1

2018-07-26 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558420#comment-16558420
 ] 

LI Guobao edited comment on SYSTEMML-2466 at 7/26/18 3:23 PM:
--

Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it 
appropriate for you? Or do I need to push it with another seperate PR without 
the commits of deep serialization in order to test the difference of perf? Just 
let me know.


was (Author: guobao):
Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it 
appropriate for you? Or do I need to push it with another seperate PR without 
the commits of deep serialization?

> Distributed paramserv fails on newer Spark version > 2.1
> 
>
> Key: SYSTEMML-2466
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2466
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/network/util/SystemPropertyConfigProvider
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163)
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> at 
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106)
> at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487)
> at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272)
> at org.apache.sysml.api.DMLScript.main(DMLScript.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.util.SystemPropertyConfigProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1

2018-07-26 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558420#comment-16558420
 ] 

LI Guobao commented on SYSTEMML-2466:
-

Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it 
appropriate for you? Or do I need to push it with another seperate PR without 
the commits of deep serialization?

> Distributed paramserv fails on newer Spark version > 2.1
> 
>
> Key: SYSTEMML-2466
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2466
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/network/util/SystemPropertyConfigProvider
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163)
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> at 
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106)
> at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487)
> at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272)
> at org.apache.sysml.api.DMLScript.main(DMLScript.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.util.SystemPropertyConfigProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1

2018-07-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2466:
---

Assignee: LI Guobao

> Distributed paramserv fails on newer Spark version > 2.1
> 
>
> Key: SYSTEMML-2466
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2466
> Project: SystemML
>  Issue Type: Task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/network/util/SystemPropertyConfigProvider
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163)
> at 
> org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> at 
> org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106)
> at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487)
> at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272)
> at org.apache.sysml.api.DMLScript.main(DMLScript.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.util.SystemPropertyConfigProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2465) Keep data consistency for a pre-trained model

2018-07-25 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2465:
---

 Summary: Keep data consistency for a pre-trained model
 Key: SYSTEMML-2465
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2465
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


In distributed spark backend, pass a given pre-trained model to the paramserv 
function may cause the data inconsistency. Because the pre-trained model would 
be cached in driver's memory. In this case, when kicking off the paramserv 
func, the workers firstly will try to read the data from HDFS where the dirty 
data in pre-trained model has not been persisted. This leads to a 
inconsistency. So the idea is to export the dirty data to HDFS before kicking 
off the remote workers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SYSTEMML-2090) Documentation of language extension

2018-07-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2090:
---

Assignee: LI Guobao

> Documentation of language extension
> ---
>
> Key: SYSTEMML-2090
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2090
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2423) Implementation of spark ps

2018-07-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2423.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Implementation of spark ps
> --
>
> Key: SYSTEMML-2423
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2423
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2457) Error handling and add statistics for spark backend

2018-07-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2457:

Summary: Error handling and add statistics for spark backend  (was: Error 
handling and add statistic for spark backend)

> Error handling and add statistics for spark backend
> ---
>
> Key: SYSTEMML-2457
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2457
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv

2018-07-24 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554752#comment-16554752
 ] 

LI Guobao commented on SYSTEMML-2458:
-

[~mboehm7], I've pushed the scripts for the distributed spark experiments. 
Could you please take a look on that?

> Add experiment on spark paramserv
> -
>
> Key: SYSTEMML-2458
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2458
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2414) Paramserv zero accuracy with Overlap_Reshuffle

2018-07-22 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2414.
---

> Paramserv zero accuracy with Overlap_Reshuffle
> --
>
> Key: SYSTEMML-2414
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2414
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2422) Implementation of remote worker

2018-07-22 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2422.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Implementation of remote worker
> ---
>
> Key: SYSTEMML-2422
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2422
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2458) Add experiment on spark paramserv

2018-07-19 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2458:
---

 Summary: Add experiment on spark paramserv
 Key: SYSTEMML-2458
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2458
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2443) Add experiments varied on optimizers

2018-07-19 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2443.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Add experiments varied on optimizers
> 
>
> Key: SYSTEMML-2443
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2443
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> It aims to add the scripts of doing the experiments for local ps and to 
> explore the training result with the different optimizers (sgd, adagrad, 
> adam).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2457) Error handling and add statistic for spark backend

2018-07-19 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2457:
---

 Summary: Error handling and add statistic for spark backend
 Key: SYSTEMML-2457
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2457
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2457) Error handling and add statistic for spark backend

2018-07-19 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2457.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Error handling and add statistic for spark backend
> --
>
> Key: SYSTEMML-2457
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2457
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-18 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2419.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing

2018-07-17 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao reassigned SYSTEMML-2446:
---

Assignee: LI Guobao

> Paramserv adagrad ASP batch disjoint_continuous failing
> ---
>
> Key: SYSTEMML-2446
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2446
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Caused by: java.io.IOException: File 
> scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on 
> HDFS/LFS.
> at 
> org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120)
> at 
> org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing

2018-07-17 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546134#comment-16546134
 ] 

LI Guobao edited comment on SYSTEMML-2446 at 7/17/18 7:23 AM:
--

[~mboehm7] sure.


was (Author: guobao):
[~mboehm7]sure.

> Paramserv adagrad ASP batch disjoint_continuous failing
> ---
>
> Key: SYSTEMML-2446
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2446
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Caused by: java.io.IOException: File 
> scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on 
> HDFS/LFS.
> at 
> org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120)
> at 
> org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing

2018-07-17 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546134#comment-16546134
 ] 

LI Guobao commented on SYSTEMML-2446:
-

[~mboehm7]sure.

> Paramserv adagrad ASP batch disjoint_continuous failing
> ---
>
> Key: SYSTEMML-2446
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2446
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> {code}
> Caused by: java.io.IOException: File 
> scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on 
> HDFS/LFS.
> at 
> org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120)
> at 
> org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197)
> at 
> org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434)
> at 
> org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886)
> at 
> org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2443) Add experiments varied on optimizers

2018-07-14 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544369#comment-16544369
 ] 

LI Guobao commented on SYSTEMML-2443:
-

Hi [~mboehm7], currently, I'm trying to do the experiment for local ps varied 
on optimizers. And I have already pushed the new scripts on github. In the 
meantime, I launched them firstly on my laptop with the mnist60k. But it seems 
that there are not too much difference on model precision. Hence, I wonder if 
we should apply them on some other typical training data instead of mnist. For 
example, are there any data relatively more complex to be trained?

> Add experiments varied on optimizers
> 
>
> Key: SYSTEMML-2443
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2443
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to add the scripts of doing the experiments for local ps and to 
> explore the training result with the different optimizers (sgd, adagrad, 
> adam).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2443) Add experiments varied on optimizers

2018-07-14 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2443:

Summary: Add experiments varied on optimizers  (was: Add experiment varied 
on optimizer)

> Add experiments varied on optimizers
> 
>
> Key: SYSTEMML-2443
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2443
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to add the scripts of doing the experiments for local ps and to 
> explore the difference on the optimizers (sgd, adagrad, adam).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2443) Add experiments varied on optimizers

2018-07-14 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2443:

Description: It aims to add the scripts of doing the experiments for local 
ps and to explore the training result with the different optimizers (sgd, 
adagrad, adam).  (was: It aims to add the scripts of doing the experiments for 
local ps and to explore the difference on the optimizers (sgd, adagrad, adam).)

> Add experiments varied on optimizers
> 
>
> Key: SYSTEMML-2443
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2443
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to add the scripts of doing the experiments for local ps and to 
> explore the training result with the different optimizers (sgd, adagrad, 
> adam).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2301) Second evaluation

2018-07-14 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2301.
---
   Resolution: Fixed
 Assignee: LI Guobao
Fix Version/s: SystemML 1.2

> Second evaluation
> -
>
> Key: SYSTEMML-2301
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2301
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2443) Add experiment varied on optimizer

2018-07-14 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2443:
---

 Summary: Add experiment varied on optimizer
 Key: SYSTEMML-2443
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2443
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to add the scripts of doing the experiments for local ps and to explore 
the difference on the optimizers (sgd, adagrad, adam).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2440) Got zero when casting an element of list

2018-07-13 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2440:
---

 Summary: Got zero when casting an element of list
 Key: SYSTEMML-2440
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2440
 Project: SystemML
  Issue Type: Bug
Reporter: LI Guobao


When running paramserv experiments, try to use get the element in a list in the 
function. I got zero value.
{code:java}
stride = as.integer(as.scalar(hyperparams["stride"]))
pad = as.integer(as.scalar(hyperparams["pad"]))
lambda = as.double(as.scalar(hyperparams["lambda"])){code}
{code:java}
Caused by: java.lang.RuntimeException: Incorrect parameters: height=0 
filter_height=0 stride=0 pad=0 at 
org.apache.sysml.runtime.util.DnnUtils.getP(DnnUtils.java:43) at 
org.apache.sysml.runtime.instructions.cp.DnnCPInstruction.processInstruction(DnnCPInstruction.java:457)
 at 
org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
 ... 12 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2414) Paramserv zero accuracy with Overlap_Reshuffle

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2414.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Paramserv zero accuracy with Overlap_Reshuffle
> --
>
> Key: SYSTEMML-2414
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2414
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2412) Paramserv "all the same accuracy" problem

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2412.
---

> Paramserv "all the same accuracy" problem
> -
>
> Key: SYSTEMML-2412
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2412
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> We came across the problem that all the model accuracy are the same. One of 
> the suspended bug is that the batchsize in validation method is inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (SYSTEMML-2403) Paramserv low accuracy sometimes occurred

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2403.
---

> Paramserv low accuracy sometimes occurred
> -
>
> Key: SYSTEMML-2403
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2403
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> We could observe that the low accuracy will sometimes occurred. Here is the 
> scenario: _BSP, BATCH, DISJOINT_CONTIGUOUS (or DISJOINT_RANDOM)_ with _1 
> epoch, 4 workers (I have 4 vcores in my machine) and batchSize 16_ using 60k 
> minist.
> {code:java}
> Val Loss: 2.3006845853187783
> Val Accuracy: 0.11184
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2412) Paramserv "all the same accuracy" problem

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2412.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Paramserv "all the same accuracy" problem
> -
>
> Key: SYSTEMML-2412
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2412
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> We came across the problem that all the model accuracy are the same. One of 
> the suspended bug is that the batchsize in validation method is inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2403) Paramserv low accuracy sometimes occurred

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2403.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Paramserv low accuracy sometimes occurred
> -
>
> Key: SYSTEMML-2403
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2403
> Project: SystemML
>  Issue Type: Bug
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> We could observe that the low accuracy will sometimes occurred. Here is the 
> scenario: _BSP, BATCH, DISJOINT_CONTIGUOUS (or DISJOINT_RANDOM)_ with _1 
> epoch, 4 workers (I have 4 vcores in my machine) and batchSize 16_ using 60k 
> minist.
> {code:java}
> Val Loss: 2.3006845853187783
> Val Accuracy: 0.11184
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SYSTEMML-2418) Spark data partitioner

2018-07-13 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2418.
-
   Resolution: Fixed
Fix Version/s: SystemML 1.2

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> In the context of ml, it would be more efficient to support the data 
> partitioning in distributed manner. This task aims to do the data 
> partitioning on Spark which means that all the data will be firstly splitted 
> among workers and then execute data partitioning on worker side according to 
> scheme, and then the partitioned data which stay on each worker could be 
> directly passed to run model training work without materialization on HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-13 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835
 ] 

LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:21 AM:
---

[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is about the codegen class. Aiming to avoid the 
concurrently and redundantly reloading of the class, what will the codegen 
class concretely be? Because when parsing the parfor body, it will generate a 
codegen class map.


was (Author: guobao):
[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is about the codegen class. Aiming to avoid the 
concurrently reloading of the class, what will the codegen class concretely be? 
Because when parsing the parfor body, it will generate a codegen class map.

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-13 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835
 ] 

LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:21 AM:
---

[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is about the codegen class. Aiming to avoid the 
concurrently and redundantly loading of the class, what will the codegen class 
concretely be? Because when parsing the parfor body, it will generate a codegen 
class map.


was (Author: guobao):
[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is about the codegen class. Aiming to avoid the 
concurrently and redundantly reloading of the class, what will the codegen 
class concretely be? Because when parsing the parfor body, it will generate a 
codegen class map.

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-13 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835
 ] 

LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:20 AM:
---

[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is about the codegen class. Aiming to avoid the 
concurrently reloading of the class, what will the codegen class concretely be? 
Because when parsing the parfor body, it will generate a codegen class map.


was (Author: guobao):
[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is, what is the codegen class concretely is and 
used for? Because when parsing the parfor body, it will generate a codegen 
class map.

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-13 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835
 ] 

LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:18 AM:
---

[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is, what is the codegen class concretely is and 
used for? Because when parsing the parfor body, it will generate a codegen 
class map.


was (Author: guobao):
[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is, what is the codegen class used for? Because 
when parsing the parfor body, it will generate a codegen class map.

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-13 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835
 ] 

LI Guobao commented on SYSTEMML-2419:
-

[~mboehm7], I have some questions. The first is about the setup of remote 
parfor worker. In fact, I saw that this block of code is synchronized and so I 
wonder if it means that in one executor, the parfor task will be launched in 
multi-threaded way? The second is, what is the codegen class used for? Because 
when parsing the parfor body, it will generate a codegen class map.

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2299) API design of the paramserv function

2018-07-11 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540715#comment-16540715
 ] 

LI Guobao commented on SYSTEMML-2299:
-

[~mboehm7], in fact, until now, it seems that the arguments "val_features" and 
"val_labels" have not been used inside paramserv function. For me, they are 
leveraged to calculate the model precision and we use a UDF to do it but not 
inside our paramserv function. So do they have some other utilities?

> API design of the paramserv function
> 
>
> Key: SYSTEMML-2299
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2299
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Fix For: SystemML 1.2
>
>
> The objective of “paramserv” built-in function is to update an initial or 
> existing model with configuration. An initial function signature would be: 
> {code:java}
> model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, 
> val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", 
> freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", 
> hyperparams=params, checkpointing="NONE"){code}
> We are interested in providing the model (which will be a struct-like data 
> structure consisting of the weights, the biases and the hyperparameters), the 
> training features and labels, the validation features and labels, the batch 
> update function (i.e., gradient calculation func), the update strategy (e.g. 
> sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch 
> or mini-batch), the gradient aggregation function, the number of epoch, the 
> batch size, the degree of parallelism, the data partition scheme, a list of 
> additional hyper parameters, as well as the checkpointing strategy. And the 
> function will return a trained model in struct format.
> *Inputs*:
>  * model : a list consisting of the weight and bias matrices
>  * features : training features matrix
>  * labels : training label matrix
>  * val_features : validation features matrix
>  * val_labels : validation label matrix
>  * upd : the name of gradient calculation function
>  * agg : the name of gradient aggregation function
>  * mode  (options: LOCAL, REMOTE_SPARK): the execution backend where 
> the parameter is executed
>  * utype  (options: BSP, ASP, SSP): the updating mode
>  * freq  [optional] (default: BATCH) (options: EPOCH, BATCH) : the 
> frequence of updates
>  * epochs : the number of epoch
>  * batchsize  [optional] (default: 64): the size of batch, if the 
> update frequence is "EPOCH", this argument will be ignored
>  * k  [optional] (default: number of vcores, otherwise vcores / 2 if 
> using openblas): the degree of parallelism
>  * scheme  [optional] (default: disjoint_contiguous) (options: 
> disjoint_contiguous, disjoint_round_robin, disjoint_random, 
> overlap_reshuffle): the scheme of data partition, i.e., how the data is 
> distributed across workers
>  * hyperparams  [optional]: a list consisting of the additional hyper 
> parameters, e.g., learning rate, momentum
>  * checkpointing [optional] (default: NONE) (options: NONE, EPOCH, 
> EPOCH10) : the checkpoint strategy, we could set a checkpoint for each epoch 
> or each 10 epochs 
> *Output*:
>  * model' : a list consisting of the updated weight and bias matrices



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-06 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534950#comment-16534950
 ] 

LI Guobao edited comment on SYSTEMML-2419 at 7/6/18 3:07 PM:
-

[~mboehm7] , I have a problem when serializing the instructions. I got some 
Spark instruction which could not be serialized. Hence, my question is that 
should we recreate the instruction by forcing the HOPs with CP type. And also, 
I'd like to know how do Parfor handle this case? Or it will not generate the SP 
instructions?
 Here is the stack:
{code:java}
Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: 
Instructions of type other than CP instructions 
org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction
SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE
{code}


was (Author: guobao):
[~mboehm7] , I have a problem when serializing the instructions. I got some 
Spark instruction which could not be serialized. Hence, my question is that 
should we recreate the instruction by forcing the HOPs with CP type.
Here is the stack:

{code:java}
Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: 
Instructions of type other than CP instructions 
org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction
SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE
{code}


> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-06 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534950#comment-16534950
 ] 

LI Guobao commented on SYSTEMML-2419:
-

[~mboehm7] , I have a problem when serializing the instructions. I got some 
Spark instruction which could not be serialized. Hence, my question is that 
should we recreate the instruction by forcing the HOPs with CP type.
Here is the stack:

{code:java}
Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: 
Instructions of type other than CP instructions 
org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction
SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE
{code}


> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-07-02 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2419:

Description: In the context of distributed spark env, we need to firstly 
ship the necessary functions and variables to the remote workers and then to 
initialize and register the cleanup of buffer pool for each remote worker. All 
these are inspired by the parfor implementation.  (was: In the context of 
distributed spark env, we need to initialize and register the cleanup of buffer 
pool for each remote worker. It could be inspired by the parfor implementation.)

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of distributed spark env, we need to firstly ship the 
> necessary functions and variables to the remote workers and then to 
> initialize and register the cleanup of buffer pool for each remote worker. 
> All these are inspired by the parfor implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

2018-06-28 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Description: In the context of ml, it would be more efficient to support 
the data partitioning in distributed manner. This task aims to do the data 
partitioning on Spark which means that all the data will be firstly splitted 
among workers and then execute data partitioning on worker side according to 
scheme, and then the partitioned data which stay on each worker could be 
directly passed to run model training work without materialization on HDFS.  
(was: In the context of ml, it would be more efficient to support the data 
partitioning in distributed manner. This task aims to do the data partitioning 
on Spark which means that all the data will be firstly splitted among workers 
and then execute data partitioning on worker side according to scheme, and then 
the partitioned data which stay on each worker could be directly passed to run 
model training work.)

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ml, it would be more efficient to support the data 
> partitioning in distributed manner. This task aims to do the data 
> partitioning on Spark which means that all the data will be firstly splitted 
> among workers and then execute data partitioning on worker side according to 
> scheme, and then the partitioned data which stay on each worker could be 
> directly passed to run model training work without materialization on HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

2018-06-28 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Description: In the context of ml, it would be more efficient to support 
the data partitioning in distributed manner. This task aims to do the data 
partitioning on Spark which means that all the data will be firstly splitted 
among workers and then execute data partitioning on worker side according to 
scheme, and then the partitioned data which stay on each worker could be 
directly passed to run model training work.  (was: In the context of ml, the 
training data will be usually overfitted in spark driver node. So to partition 
such enormous data is no more feasible in CP. This task aims to do the data 
partitioning in distributed way which means that the workers will receive its 
split of training data and do the data partition locally according to different 
schemes. And then all the data will be grouped by the given key (i.e., the 
worker id) and at last be written into the seperate HDFS file in scratch place.)

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ml, it would be more efficient to support the data 
> partitioning in distributed manner. This task aims to do the data 
> partitioning on Spark which means that all the data will be firstly splitted 
> among workers and then execute data partitioning on worker side according to 
> scheme, and then the partitioned data which stay on each worker could be 
> directly passed to run model training work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2418) Spark data partitioner

2018-06-27 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525299#comment-16525299
 ] 

LI Guobao commented on SYSTEMML-2418:
-

[~mboehm7], is it correct my logic? By the way, I'd like to know if the scratch 
place is shared by all the remote workers? If so, the workers could load the 
file from this hdfs repository?

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark 
> driver node. So to partition such enormous data is no more feasible in CP. 
> This task aims to do the data partitioning in distributed way which means 
> that the workers will receive its split of training data and do the data 
> partition locally according to different schemes. And then all the data will 
> be grouped by the given key (i.e., the worker id) and at last be written into 
> the seperate HDFS file in scratch place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

2018-06-27 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Description: In the context of ml, the training data will be usually 
overfitted in spark driver node. So to partition such enormous data is no more 
feasible in CP. This task aims to do the data partitioning in distributed way 
which means that the workers will receive its split of training data and do the 
data partition locally according to different schemes. And then all the data 
will be grouped by the given key (i.e., the worker id) and at last be written 
into the seperate HDFS file in scratch place.  (was: In the context of ml, the 
training data will be usually overfitted in spark driver node. So to partition 
such enormous data is no more feasible in CP. This task aims to do the data 
partitioning in distributed way which means that the workers will receive its 
split of training data and do the data partition locally according to different 
schemes. And then all the data will be grouped by the given key (i.e., the 
worker id) and at last be written into the seperate HDFS file.)

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark 
> driver node. So to partition such enormous data is no more feasible in CP. 
> This task aims to do the data partitioning in distributed way which means 
> that the workers will receive its split of training data and do the data 
> partition locally according to different schemes. And then all the data will 
> be grouped by the given key (i.e., the worker id) and at last be written into 
> the seperate HDFS file in scratch place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

2018-06-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Description: In the context of ml, the training data will be usually 
overfitted in spark driver node. So to partition such enormous data is no more 
feasible in CP. This task aims to do the data partitioning in distributed way 
which means that the workers will receive its split of training data and do the 
data partition locally according to different schemes. And then all the data 
will be grouped by the given key (i.e., the worker id) and at last be written 
into the seperate HDFS file.  (was: In the context of ps, the training data 
will be partitioned according to the different schemes. This conversion is 
executed in driver node and the partitioned data should be distributed to 
workers via broadcast. Due to the 2G limitation of spark broadcast, we could 
leverage the _PartitionedBroadcast_ class to do this conversion. Afterwards, 
the partitioned broadcast object can be passed to workers for launching its 
job.)

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark 
> driver node. So to partition such enormous data is no more feasible in CP. 
> This task aims to do the data partitioning in distributed way which means 
> that the workers will receive its split of training data and do the data 
> partition locally according to different schemes. And then all the data will 
> be grouped by the given key (i.e., the worker id) and at last be written into 
> the seperate HDFS file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

2018-06-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Summary: Spark data partitioner  (was: Distributing data to workers)

> Spark data partitioner
> --
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ps, the training data will be partitioned according to the 
> different schemes. This conversion is executed in driver node and the 
> partitioned data should be distributed to workers via broadcast. Due to the 
> 2G limitation of spark broadcast, we could leverage the 
> _PartitionedBroadcast_ class to do this conversion. Afterwards, the 
> partitioned broadcast object can be passed to workers for launching its job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles

2018-06-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2421:

Description: It aims to introduce the checkpointing to guarantee that the 
worker could recover from previous failure. In details, once a worker is 
brought up it pulls the current state of the model which consists of each 
worker's process (i.e., which batch iteration and epoch is being executing). 
And the checkpointing could be set to EPOCH10 which means that every 10 epoch 
the state will be persisted in centralized file on server side.  (was: It aims 
to introduce the checkpointing to guarantee that the worker could recover from 
previous failure. In details, once a worker is brought up it pulls the current 
state of the model. And the checkpointing could be set to be EPOCH10 which 
means that every 10 epoch the state will be persisted in centralized file on 
server side.)

> Task error and preemption handles
> -
>
> Key: SYSTEMML-2421
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to introduce the checkpointing to guarantee that the worker could 
> recover from previous failure. In details, once a worker is brought up it 
> pulls the current state of the model which consists of each worker's process 
> (i.e., which batch iteration and epoch is being executing). And the 
> checkpointing could be set to EPOCH10 which means that every 10 epoch the 
> state will be persisted in centralized file on server side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles

2018-06-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2421:

Description: It aims to introduce the checkpointing to guarantee that the 
worker could recover from previous failure. In details, once a worker is 
brought up it pulls the current state of the model. And the checkpointing could 
be set to be EPOCH10 which means that every 10 epoch the state will be 
persisted in centralized file on server side.  (was: It aims to introduce the 
checkpointing to guarantee that the worker could recover from previous failure. 
In details, once a worker is brought up it pulls the current state of the 
model. And the checkpointing could be set to be EPOCH10 which means that every 
10 epoch the state will be persisted in a file on worker side.)

> Task error and preemption handles
> -
>
> Key: SYSTEMML-2421
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to introduce the checkpointing to guarantee that the worker could 
> recover from previous failure. In details, once a worker is brought up it 
> pulls the current state of the model. And the checkpointing could be set to 
> be EPOCH10 which means that every 10 epoch the state will be persisted in 
> centralized file on server side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles

2018-06-26 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2421:

Description: It aims to introduce the checkpointing to guarantee that the 
worker could recover from previous failure. In details, once a worker is 
brought up it pulls the current state of the model. And the checkpointing could 
be set to be EPOCH10 which means that every 10 epoch the state will be 
persisted in a file on worker side.  (was: It aims to introduce the 
checkpointing to guarantee that the task could recover from failure. In 
details, once a worker is brought up it pulls the current state of the model. 
And the checkpointing could be set to be EPOCH10 which means that every 10 
epoch the state will be persisted in a file.)

> Task error and preemption handles
> -
>
> Key: SYSTEMML-2421
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to introduce the checkpointing to guarantee that the worker could 
> recover from previous failure. In details, once a worker is brought up it 
> pulls the current state of the model. And the checkpointing could be set to 
> be EPOCH10 which means that every 10 epoch the state will be persisted in a 
> file on worker side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2420) Communication between ps and workers

2018-06-25 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522920#comment-16522920
 ] 

LI Guobao commented on SYSTEMML-2420:
-

[~mboehm7], thanks for the feedback. And I decided to take the latter option 
which means implementing our own Rpc communication. I have uploaded some 
diagrams as well as updating the description. I'm looking forward to your 
review.

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage netty framework to implement our own Rpc framework. In general, the 
> netty {{TransportClient}} and {{TransportServer}} provides the sending and 
> receiving service for ps and workers. Extending the {{RpcHandler}} allows to 
> invoke the corresponding ps method (i.e., push/pull method) by handling the 
> different input Rpc call object. And then the {{SparkPsProxy}} wrapping 
> {{TransportClient}} allows the workers to execute the push/pull call to 
> server. At the same time, the ps netty server also provides the file 
> repository service which allows the workers to download the partitioned 
> training data, so that the workers could rebuild the matrix object with the 
> transfered file instead of broadcasting all the files with spark which are 
> not all necessary for each worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Attachment: systemml_rpc_class_diagram.png

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage netty framework to implement our own Rpc framework. In general, the 
> netty {{TransportClient}} and {{TransportServer}} provides the sending and 
> receiving service for ps and workers. Extending the {{RpcHandler}} allows to 
> invoke the corresponding ps method (i.e., push/pull method) by handling the 
> different input Rpc call object. And then the {{SparkPsProxy}} wrapping 
> {{TransportClient}} allows the workers to execute the push/pull call to 
> server. At the same time, the ps netty server also provides the file 
> repository service which allows the workers to download the partitioned 
> training data, so that the workers could rebuild the matrix object with the 
> transfered file instead of broadcasting all the files with spark which are 
> not all necessary for each worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Description: It aims to implement the parameter exchange between ps and 
workers. We could leverage netty framework to implement our own Rpc framework. 
In general, the netty {{TransportClient}} and {{TransportServer}} provides the 
sending and receiving service for ps and workers. Extending the {{RpcHandler}} 
allows to invoke the corresponding ps method (i.e., push/pull method) by 
handling the different input Rpc call object. And then the {{SparkPsProxy}} 
wrapping {{TransportClient}} allows the workers to execute the push/pull call 
to server. At the same time, the ps netty server also provides the file 
repository service which allows the workers to download the partitioned 
training data, so that the workers could rebuild the matrix object with the 
transfered file instead of broadcasting all the files with spark which are not 
all necessary for each worker.  (was: It aims to implement the parameter 
exchange between ps and workers. We could leverage spark RPC to setup a ps 
endpoint in driver node which means that the ps service could be discovered by 
workers in the network. And then the workers could invoke the pull/push method 
via RPC using the registered endpoint of ps service. Hence, in details, this 
tasks consists of registering the ps endpoint in spark rpc framework and using 
rpc to invoke target method in worker side. We can learn that the spark rpc is 
implemented in Scala. Hence we need to wrap them in in order to be used in 
Java. Overall, we could register the ps service with _RpcEndpoint_ and invoke 
the service with _RpcEndpointRef_.)

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage netty framework to implement our own Rpc framework. In general, the 
> netty {{TransportClient}} and {{TransportServer}} provides the sending and 
> receiving service for ps and workers. Extending the {{RpcHandler}} allows to 
> invoke the corresponding ps method (i.e., push/pull method) by handling the 
> different input Rpc call object. And then the {{SparkPsProxy}} wrapping 
> {{TransportClient}} allows the workers to execute the push/pull call to 
> server. At the same time, the ps netty server also provides the file 
> repository service which allows the workers to download the partitioned 
> training data, so that the workers could rebuild the matrix object with the 
> transfered file instead of broadcasting all the files with spark which are 
> not all necessary for each worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Attachment: systemml_rpc_sequence_diagram.png

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. 
> Hence we need to wrap them in in order to be used in Java. Overall, we could 
> register the ps service with _RpcEndpoint_ and invoke the service with 
> _RpcEndpointRef_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-25 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Attachment: systemml_rpc_2_seq_diagram.png

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
> Attachments: systemml_rpc_2_seq_diagram.png, 
> systemml_rpc_sequence_diagram.png
>
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. 
> Hence we need to wrap them in in order to be used in Java. Overall, we could 
> register the ps service with _RpcEndpoint_ and invoke the service with 
> _RpcEndpointRef_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SYSTEMML-2420) Communication between ps and workers

2018-06-24 Thread LI Guobao (JIRA)


[ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521704#comment-16521704
 ] 

LI Guobao commented on SYSTEMML-2420:
-

[~mboehm7], I wrote down the idea for implementing the exchange between ps and 
workers. And is this idea correct?

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. 
> Hence we need to wrap them in in order to be used in Java. Overall, we could 
> register the ps service with _RpcEndpoint_ and invoke the service with 
> _RpcEndpointRef_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Description: It aims to implement the parameter exchange between ps and 
workers. We could leverage spark RPC to setup a ps endpoint in driver node 
which means that the ps service could be discovered by workers in the network. 
And then the workers could invoke the pull/push method via RPC using the 
registered endpoint of ps service. Hence, in details, this tasks consists of 
registering the ps endpoint in spark rpc framework and using rpc to invoke 
target method in worker side. We can learn that the spark rpc is implemented in 
Scala. Hence we need to wrap them in in order to be used in Java. Overall, we 
could register the ps service with _RpcEndpoint_ and invoke the service with 
_RpcEndpointRef_.  (was: It aims to implement the parameter exchange between ps 
and workers. We could leverage spark RPC to setup a ps endpoint in driver node 
which means that the ps service could be discovered by workers in the network. 
And then the workers could invoke the pull/push method via RPC using the 
registered endpoint of ps service. Hence, in details, this tasks consists of 
registering the ps endpoint in spark rpc framework and using rpc to invoke 
target method in worker side. We can learn that the spark rpc is implemented in 
Scala. But we could easily wrap them in java class to be reused. And then the 
two methods (push, pull) could be wrapped into this defined endpoint.)

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. 
> Hence we need to wrap them in in order to be used in Java. Overall, we could 
> register the ps service with _RpcEndpoint_ and invoke the service with 
> _RpcEndpointRef_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Description: It aims to implement the parameter exchange between ps and 
workers. We could leverage spark RPC to setup a ps endpoint in driver node 
which means that the ps service could be discovered by workers in the network. 
And then the workers could invoke the pull/push method via RPC using the 
registered endpoint of ps service. Hence, in details, this tasks consists of 
registering the ps endpoint in spark rpc framework and using rpc to invoke 
target method in worker side. We can learn that the spark rpc is implemented in 
Scala. But we could easily wrap them in java class to be reused. And then the 
two methods (push, pull) could be wrapped into this defined endpoint.  (was: It 
aims to implement the parameter exchange between ps and workers. We could 
leverage spark RPC to setup a ps endpoint in driver node which means that the 
ps service could be discovered by workers in the network. And then the workers 
could invoke the pull/push method via RPC using the registered endpoint of ps 
service. Hence, in details, this tasks consists of registering the ps endpoint 
in spark rpc framework and using rpc to invoke target method in worker side. We 
can learn that the spark rpc is implemented in Scala. But we could easily wrap 
them in java class to be reused.)

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. But 
> we could easily wrap them in java class to be reused. And then the two 
> methods (push, pull) could be wrapped into this defined endpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2420:

Description: It aims to implement the parameter exchange between ps and 
workers. We could leverage spark RPC to setup a ps endpoint in driver node 
which means that the ps service could be discovered by workers in the network. 
And then the workers could invoke the pull/push method via RPC using the 
registered endpoint of ps service. Hence, in details, this tasks consists of 
registering the ps endpoint in spark rpc framework and using rpc to invoke 
target method in worker side. We can learn that the spark rpc is implemented in 
Scala. But we could easily wrap them in java class to be reused.  (was: It aims 
to implement the parameter exchange between ps and workers. We could leverage 
spark RPC to setup a ps endpoint in driver node which means that the ps service 
could be discovered by workers in the network. And then the workers could 
invoke the pull/push method via RPC using the registered endpoint of ps 
service. Hence, in details, this tasks consists of registering the ps endpoint 
in spark rpc framework and using rpc to invoke target method in worker side.)

> Communication between ps and workers
> 
>
> Key: SYSTEMML-2420
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to implement the parameter exchange between ps and workers. We could 
> leverage spark RPC to setup a ps endpoint in driver node which means that the 
> ps service could be discovered by workers in the network. And then the 
> workers could invoke the pull/push method via RPC using the registered 
> endpoint of ps service. Hence, in details, this tasks consists of registering 
> the ps endpoint in spark rpc framework and using rpc to invoke target method 
> in worker side. We can learn that the spark rpc is implemented in Scala. But 
> we could easily wrap them in java class to be reused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2424) Determine the level of par

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2424:

Description: It aims to determine the parallelism level according to the 
cluster resource, i.e., the total number of vcores.  (was: It aims to determine 
the parallelism level according to the cluster resource, i.e., the vcore of 
each executor.)

> Determine the level of par
> --
>
> Key: SYSTEMML-2424
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2424
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> It aims to determine the parallelism level according to the cluster resource, 
> i.e., the total number of vcores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2424) Determine the level of par

2018-06-24 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2424:
---

 Summary: Determine the level of par
 Key: SYSTEMML-2424
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2424
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to determine the parallelism level according to the cluster resource, 
i.e., the vcore of each executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2423) Implementation of spark ps

2018-06-24 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2423:
---

 Summary: Implementation of spark ps
 Key: SYSTEMML-2423
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2423
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2422) Implementation of remote worker

2018-06-24 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2422:
---

 Summary: Implementation of remote worker
 Key: SYSTEMML-2422
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2422
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2418) Distributing data to workers

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2418:

Description: In the context of ps, the training data will be partitioned 
according to the different schemes. This conversion is executed in driver node 
and the partitioned data should be distributed to workers via broadcast. Due to 
the 2G limitation of spark broadcast, we could leverage the 
_PartitionedBroadcast_ class to do this conversion. Afterwards, the partitioned 
broadcast object can be passed to workers for launching its job.

> Distributing data to workers
> 
>
> Key: SYSTEMML-2418
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>
> In the context of ps, the training data will be partitioned according to the 
> different schemes. This conversion is executed in driver node and the 
> partitioned data should be distributed to workers via broadcast. Due to the 
> 2G limitation of spark broadcast, we could leverage the 
> _PartitionedBroadcast_ class to do this conversion. Afterwards, the 
> partitioned broadcast object can be passed to workers for launching its job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:

Description: This part aims to implement the parameter server for spark 
distributed backend.  (was: This part aims to implement the parameter server 
for spark distributed backend. In general, the implementation of ps is very 
close to local ps. The ps provides the pull/push service to the workers in 
driver node whereas the communication between ps and workers will be done vias 
RPC. And then the data needs to be distributed to the workers according to the 
different data partition schemes. The worker setup and cleanup is different 
from the local one which needs to be handled.)

> Initial version of distributed spark backend
> 
>
> Key: SYSTEMML-2087
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: Matthias Boehm
>Assignee: LI Guobao
>Priority: Major
>
> This part aims to implement the parameter server for spark distributed 
> backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2421) Task error and preemption handles

2018-06-24 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2421:
---

 Summary: Task error and preemption handles
 Key: SYSTEMML-2421
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2421
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to introduce the checkpointing to guarantee that the task could recover 
from failure. In details, once a worker is brought up it pulls the current 
state of the model. And the checkpointing could be set to be EPOCH10 which 
means that every 10 epoch the state will be persisted in a file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SYSTEMML-2420) Communication between ps and workers

2018-06-24 Thread LI Guobao (JIRA)
LI Guobao created SYSTEMML-2420:
---

 Summary: Communication between ps and workers
 Key: SYSTEMML-2420
 URL: https://issues.apache.org/jira/browse/SYSTEMML-2420
 Project: SystemML
  Issue Type: Sub-task
Reporter: LI Guobao
Assignee: LI Guobao


It aims to implement the parameter exchange between ps and workers. We could 
leverage spark RPC to setup a ps endpoint in driver node which means that the 
ps service could be discovered by workers in the network. And then the workers 
could invoke the pull/push method via RPC using the registered endpoint of ps 
service. Hence, in details, this tasks consists of registering the ps endpoint 
in spark rpc framework and using rpc to invoke target method in worker side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SYSTEMML-2419) Setup and cleanup of remote workers

2018-06-24 Thread LI Guobao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2419:

Summary: Setup and cleanup of remote workers  (was: Setup of remote workers)

> Setup and cleanup of remote workers
> ---
>
> Key: SYSTEMML-2419
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2419
> Project: SystemML
>  Issue Type: Sub-task
>Reporter: LI Guobao
>Assignee: LI Guobao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   >