[jira] [Resolved] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing
[ https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2446. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Paramserv adagrad ASP batch disjoint_continuous failing > --- > > Key: SYSTEMML-2446 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2446 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > {code} > Caused by: java.io.IOException: File > scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on > HDFS/LFS. > at > org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120) > at > org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2304) Submit final product
[ https://issues.apache.org/jira/browse/SYSTEMML-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2304. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Submit final product > > > Key: SYSTEMML-2304 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2304 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2302) Second version of execution backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2302. --- Resolution: Invalid Fix Version/s: SystemML 1.2 > Second version of execution backend > --- > > Key: SYSTEMML-2302 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2302 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > This part aims to complement the updating strategies by adding ASP and SSP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2458) Add experiment on spark paramserv
[ https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2458. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Add experiment on spark paramserv > - > > Key: SYSTEMML-2458 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2458 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2090) Documentation of language extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2090. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Documentation of language extension > --- > > Key: SYSTEMML-2090 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2090 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2299: Description: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", hyperparams=params, checkpointing="NONE"){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * features : training features matrix * labels : training label matrix * val_features [optional]: validation features matrix * val_labels [optional]: validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: LOCAL, REMOTE_SPARK): the execution backend where the parameter is executed * utype (options: BSP, ASP, SSP): the updating mode * freq [optional] (default: BATCH) (options: EPOCH, BATCH) : the frequence of updates * epochs : the number of epoch * batchsize [optional] (default: 64): the size of batch, if the update frequence is "EPOCH", this argument will be ignored * k [optional] (default: number of vcores, otherwise vcores / 2 if using openblas): the degree of parallelism * scheme [optional] (default: disjoint_contiguous) (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparams [optional]: a list consisting of the additional hyper parameters, e.g., learning rate, momentum * checkpointing [optional] (default: NONE) (options: NONE, EPOCH, EPOCH10) : the checkpoint strategy, we could set a checkpoint for each epoch or each 10 epochs *Output*: * model' : a list consisting of the updated weight and bias matrices was: The objective of “paramserv” built-in function is to update an initial or existing model with configuration. An initial function signature would be: {code:java} model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", hyperparams=params, checkpointing="NONE"){code} We are interested in providing the model (which will be a struct-like data structure consisting of the weights, the biases and the hyperparameters), the training features and labels, the validation features and labels, the batch update function (i.e., gradient calculation func), the update strategy (e.g. sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch or mini-batch), the gradient aggregation function, the number of epoch, the batch size, the degree of parallelism, the data partition scheme, a list of additional hyper parameters, as well as the checkpointing strategy. And the function will return a trained model in struct format. *Inputs*: * model : a list consisting of the weight and bias matrices * features : training features matrix * labels : training label matrix * val_features : validation features matrix * val_labels : validation label matrix * upd : the name of gradient calculation function * agg : the name of gradient aggregation function * mode (options: LOCAL, REMOTE_SPARK): the execution backend where the parameter is executed * utype (options: BSP, ASP, SSP): the updating mode * freq [optional] (default: BATCH) (options: EPOCH, BATCH) : the frequence of updates * epochs : the number of epoch * batchsize [optional] (default: 64): the size of batch, if the update frequence is "EPOCH", this argument will be ignored * k [optional] (default: number of vcores, otherwise vcores / 2 if using openblas): the degree of parallelism * scheme [optional] (default: disjoint_contiguous) (options: disjoint_contiguous, disjoint_round_robin, disjoint_random, overlap_reshuffle): the scheme of data partition, i.e., how the data is distributed across workers * hyperparams [optional]: a list consisting
[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv
[ https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569424#comment-16569424 ] LI Guobao commented on SYSTEMML-2458: - [~mboehm7], yes, I added the baseline experiment w/o paramserv and fixed the location of SystemML-config.xml file. Addtionnally, I've double checked the configuration of native BLAS for remote worker and it is well transferred and set to remote worker. > Add experiment on spark paramserv > - > > Key: SYSTEMML-2458 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2458 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv
[ https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569312#comment-16569312 ] LI Guobao commented on SYSTEMML-2458: - [~mboehm7], for the reason of hoping to have some experiments result for the presentation, I have pushed the latest polished scripts and the new packaged jar with the recent patches. Maybe we could continue to launch the experiments? > Add experiment on spark paramserv > - > > Key: SYSTEMML-2458 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2458 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2482. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. However, > the matrices referenced by the list should be pinned to avoid being cleaned > up. And this issue is related to > [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567477#comment-16567477 ] LI Guobao commented on SYSTEMML-2482: - I just saw your latest commit. Thanks for helping me. And yes. So we keep the current behavior. > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. However, > the matrices referenced by the list should be pinned to avoid being cleaned > up. And this issue is related to > [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567470#comment-16567470 ] LI Guobao commented on SYSTEMML-2482: - [~mboehm7] well, sorry about the unspecific description. Actually, I just found that the data status in list object is no longer used (i.e., null or array of false). Because, before that commit, all the matrices of the list output will be pinned in the vars table and the pinned status will be saved in this boolean array. And I just fix the problem of eviction by changing the logic of cleanning up the list object with its data status. > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. However, > the matrices referenced by the list should be pinned to avoid being cleaned > up. And this issue is related to > [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2482: Description: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. However, the matrices referenced by the list should be pinned to avoid being cleaned up. And this issue is related to [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] (was: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. And this issue is related to [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] ) > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. However, > the matrices referenced by the list should be pinned to avoid being cleaned > up. And this issue is related to > [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2482: Description: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. And this issue is related to [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] (was: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. And this issue is related to [ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481] ) > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. And this > issue is related to > [SYSTEMML-2481|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2482: Description: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. And this issue is related to [ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481] (was: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution.) > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. And this > issue is related to > [ticket|https://issues.apache.org/jira/browse/SYSTEMML-2481] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2482: Description: Some unexpected overhead occurred when running the {{*testParamservASPEpochDisjointContiguous*}} in test {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. (was: Some unexpected overhead occurred when running the {{testParamservASPEpochDisjointContiguous}} in test {{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution.) > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{*testParamservASPEpochDisjointContiguous*}} in test > {{*org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest*}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2482) Unexpected cleanup of list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2482: Description: Some unexpected overhead occurred when running the {{testParamservASPEpochDisjointContiguous}} in test {{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}. It took more time to finish the test in the case that the output of instruction is a list which will be cleaned up after the execution. > Unexpected cleanup of list object > - > > Key: SYSTEMML-2482 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > Some unexpected overhead occurred when running the > {{testParamservASPEpochDisjointContiguous}} in test > {{org.apache.sysml.test.integration.functions.paramserv.ParamservSparkNNTest}}. > It took more time to finish the test in the case that the output of > instruction is a list which will be cleaned up after the execution. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2482) Unexpected cleanup of list object
LI Guobao created SYSTEMML-2482: --- Summary: Unexpected cleanup of list object Key: SYSTEMML-2482 URL: https://issues.apache.org/jira/browse/SYSTEMML-2482 Project: SystemML Issue Type: Bug Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func
[ https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2478: Description: When using parfor inside update function, some MR tasks are launched to write the output of task. And it took more time to finish the paramserv run than without parfor in update function. The scenario is to launch the ASP Epoch DC spark paramserv test. Here is the stack: {code:java} Total elapsed time: 101.804 sec. Total compilation time: 3.690 sec. Total execution time: 98.114 sec. Number of compiled Spark inst: 302. Number of executed Spark inst: 540. Cache hits (Mem, WB, FS, HDFS): 57839/0/0/240. Cache writes (WB, FS, HDFS):14567/58/61. Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec. HOP DAGs recompiled (PRED, SB): 0/144. HOP DAGs recompile time:0.507 sec. Functions recompiled: 16. Functions recompile time: 0.064 sec. Spark ctx create time (lazy): 1.376 sec. Spark trans counts (par,bc,col):270/1/240. Spark trans times (par,bc,col): 0.573/0.197/42.255 secs. Paramserv total num workers:3. Paramserv setup time: 1.559 secs. Paramserv grad compute time:105.701 secs. Paramserv model update time:56.801/47.193 secs. Paramserv model broadcast time: 23.872 secs. Paramserv batch slice time: 0.000 secs. Paramserv RPC request time: 105.159 secs. ParFor loops optimized: 1. ParFor optimize time: 0.040 sec. ParFor initialize time: 0.434 sec. ParFor result merge time: 0.005 sec. ParFor total update in-place: 0/7/7 Total JIT compile time: 68.384 sec. Total JVM GC count: 1120. Total JVM GC time: 22.338 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 paramserv97.221 1 2 conv2d_bias_add 60.581614 3 *54.990 12447 4 sp_- 20.625240 5 -17.979 7287 6 +14.191 12824 7 r'5.636 1200 8 conv2d_backward_filter5.123600 9 max 4.985907 10 ba+* 4.591 1814 {code} Here is the polished update func: {code:java} aggregation = function(list[unknown] model, list[unknown] gradients, list[unknown] hyperparams) return (list[unknown] modelResult) { lr = as.double(as.scalar(hyperparams["lr"])) mu = as.double(as.scalar(hyperparams["mu"])) modelResult = model # Optimize with SGD w/ Nesterov momentum parfor(i in 1:8, check=0) { P = as.matrix(model[i]) dP = as.matrix(gradients[i]) vP = as.matrix(model[8+i]) [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP) modelResult[i] = P modelResult[8+i] = vP } } {code} [~mboehm7], in fact, I have no idea where the cause comes from? It seems that it tried to write the parfor task output into HDFS. So is it the normal behavior? was: When using parfor inside update function, some MR tasks are launched to write the output of task. And it took more time to finish the paramserv run than without parfor in update function. The scenario is to launch the ASP Epoch DC spark paramserv test. Here is the stack: {code:java} Total elapsed time: 101.804 sec. Total compilation time: 3.690 sec. Total execution time: 98.114 sec. Number of compiled Spark inst: 302. Number of executed Spark inst: 540. Cache hits (Mem, WB, FS, HDFS): 57839/0/0/*240*. Cache writes (WB, FS, HDFS):14567/58/61. Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec. HOP DAGs recompiled (PRED, SB): 0/144. HOP DAGs recompile time:0.507 sec. Functions recompiled: 16. Functions recompile time: 0.064 sec. Spark ctx create time (lazy): 1.376 sec. Spark trans counts (par,bc,col):270/1/240. Spark trans times (par,bc,col): 0.573/0.197/42.255 secs. Paramserv total num workers:3. Paramserv setup time: 1.559 secs. Paramserv grad compute time:105.701 secs. Paramserv model update time:56.801/47.193 secs. Paramserv model broadcast time: 23.872 secs. Paramserv batch slice time: 0.000 secs. Paramserv RPC request time: 105.159 secs. ParFor loops optimized: 1. ParFor optimize time: 0.040 sec. ParFor initialize time: 0.434 sec. ParFor result merge time: 0.005 sec. ParFor total update in-place: 0/7/7 Total JIT compile time: 68.384 sec. Total JVM GC count: 1120. Total JVM GC time: 22.338 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 paramserv97.221 1 2 conv2d_bias_add 60.581614 3 *54.990 12447 4 sp_- 20.625
[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func
[ https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2478: Description: When using parfor inside update function, some MR tasks are launched to write the output of task. And it took more time to finish the paramserv run than without parfor in update function. The scenario is to launch the ASP Epoch DC spark paramserv test. Here is the stack: {code:java} Total elapsed time: 101.804 sec. Total compilation time: 3.690 sec. Total execution time: 98.114 sec. Number of compiled Spark inst: 302. Number of executed Spark inst: 540. Cache hits (Mem, WB, FS, HDFS): 57839/0/0/*240*. Cache writes (WB, FS, HDFS):14567/58/61. Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec. HOP DAGs recompiled (PRED, SB): 0/144. HOP DAGs recompile time:0.507 sec. Functions recompiled: 16. Functions recompile time: 0.064 sec. Spark ctx create time (lazy): 1.376 sec. Spark trans counts (par,bc,col):270/1/240. Spark trans times (par,bc,col): 0.573/0.197/42.255 secs. Paramserv total num workers:3. Paramserv setup time: 1.559 secs. Paramserv grad compute time:105.701 secs. Paramserv model update time:56.801/47.193 secs. Paramserv model broadcast time: 23.872 secs. Paramserv batch slice time: 0.000 secs. Paramserv RPC request time: 105.159 secs. ParFor loops optimized: 1. ParFor optimize time: 0.040 sec. ParFor initialize time: 0.434 sec. ParFor result merge time: 0.005 sec. ParFor total update in-place: 0/7/7 Total JIT compile time: 68.384 sec. Total JVM GC count: 1120. Total JVM GC time: 22.338 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 paramserv97.221 1 2 conv2d_bias_add 60.581614 3 *54.990 12447 4 sp_- 20.625240 5 -17.979 7287 6 +14.191 12824 7 r'5.636 1200 8 conv2d_backward_filter5.123600 9 max 4.985907 10 ba+* 4.591 1814 {code} Here is the polished update func: {code:java} aggregation = function(list[unknown] model, list[unknown] gradients, list[unknown] hyperparams) return (list[unknown] modelResult) { lr = as.double(as.scalar(hyperparams["lr"])) mu = as.double(as.scalar(hyperparams["mu"])) modelResult = model # Optimize with SGD w/ Nesterov momentum parfor(i in 1:8, check=0) { P = as.matrix(model[i]) dP = as.matrix(gradients[i]) vP = as.matrix(model[8+i]) [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP) modelResult[i] = P modelResult[8+i] = vP } } {code} [~mboehm7], in fact, I have no idea where the cause comes from? It seems that it tried to write the parfor task output into HDFS. So is it the normal behavior? was:When using parfor inside update function, some MR tasks > Overhead when using parfor in update func > - > > Key: SYSTEMML-2478 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2478 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > When using parfor inside update function, some MR tasks are launched to write > the output of task. And it took more time to finish the paramserv run than > without parfor in update function. The scenario is to launch the ASP Epoch DC > spark paramserv test. > Here is the stack: > {code:java} > Total elapsed time: 101.804 sec. > Total compilation time: 3.690 sec. > Total execution time: 98.114 sec. > Number of compiled Spark inst:302. > Number of executed Spark inst:540. > Cache hits (Mem, WB, FS, HDFS): 57839/0/0/*240*. > Cache writes (WB, FS, HDFS): 14567/58/61. > Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec. > HOP DAGs recompiled (PRED, SB): 0/144. > HOP DAGs recompile time: 0.507 sec. > Functions recompiled: 16. > Functions recompile time: 0.064 sec. > Spark ctx create time (lazy): 1.376 sec. > Spark trans counts (par,bc,col):270/1/240. > Spark trans times (par,bc,col): 0.573/0.197/42.255 secs. > Paramserv total num workers: 3. > Paramserv setup time: 1.559 secs. > Paramserv grad compute time: 105.701 secs. > Paramserv model update time: 56.801/47.193 secs. > Paramserv model broadcast time: 23.872 secs. > Paramserv batch slice time: 0.000 secs. > Paramserv RPC request time: 105.159 secs. > ParFor loops optimized: 1. > ParFor optimize time: 0.040 sec. > ParFor initializ
[jira] [Updated] (SYSTEMML-2478) Overhead when using parfor in update func
[ https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2478: Summary: Overhead when using parfor in update func (was: Unexpected MR task when using parfor) > Overhead when using parfor in update func > - > > Key: SYSTEMML-2478 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2478 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > When using parfor inside update function, some MR tasks -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2478) Unexpected MR task when using parfor
[ https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2478: Description: When using parfor inside update function, some MR tasks > Unexpected MR task when using parfor > > > Key: SYSTEMML-2478 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2478 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > When using parfor inside update function, some MR tasks -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2478) Unexpected MR task when using parfor
LI Guobao created SYSTEMML-2478: --- Summary: Unexpected MR task when using parfor Key: SYSTEMML-2478 URL: https://issues.apache.org/jira/browse/SYSTEMML-2478 Project: SystemML Issue Type: Bug Reporter: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2477) NPE when copying list object
[ https://issues.apache.org/jira/browse/SYSTEMML-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2477. - Resolution: Fixed Fix Version/s: SystemML 1.2 > NPE when copying list object > > > Key: SYSTEMML-2477 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2477 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2477) NPE when copying list object
LI Guobao created SYSTEMML-2477: --- Summary: NPE when copying list object Key: SYSTEMML-2477 URL: https://issues.apache.org/jira/browse/SYSTEMML-2477 Project: SystemML Issue Type: Bug Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2476) Unexpected mapreduce task
[ https://issues.apache.org/jira/browse/SYSTEMML-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2476: Description: When trying to use scalar casting to get element from a list, unexpected mapreduce tasks are launched instead of CP mode. The scenario is to replace *C = 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. And then the problem could be reproduced by launching the method {{_testParamservBSPBatchDisjointContiguous_}} inside class _{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_ Here is the stack: {code:java} 18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1 18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080 18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008 {code} [~mboehm7], if possible, could you take a look on this? And I've double checked the creation of execution context in {{ParamservBuiltinCPInstruction}}. But it is instance of ExecutionContext not SparkExecutionContext. was: When trying to use scalar casting to get element from a list, unexpected mapreduce tasks are launched instead of CP mode. The scenario is to replace *C = 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. And then the problem could be reproduced by launching the method {{_testParamservBSPBatchDisjointContiguous_}} inside class _{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_ Here is the stack: {code:java} 18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1 18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080 18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008 {code} > Unexpected mapreduce task > - > > Key: SYSTEMML-2476 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2476 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Priority: Major > > When trying to use scalar casting to get element from a list, unexpected > mapreduce tasks are launched instead of CP mode. The scenario is to replace > *C = 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient > function_}} found in > {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. And > then the problem could be reproduced by launching the method > {{_testParamservBSPBatchDisjointContiguous_}} inside class > _{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_ > Here is the stack: > {code:java} > 18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1 > 18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) > 18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 > 18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080 > 18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 > 18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 > 18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: > http://localhost:8080/ > 18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008 > {code} > [~mboehm7], if possible, could you take a look on this? And I've double > checked the creation of execution context in > {{ParamservBuiltinCPInstruction}}. But it is instance of ExecutionContext not > SparkExecutionContext. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2476) Unexpected mapreduce task
LI Guobao created SYSTEMML-2476: --- Summary: Unexpected mapreduce task Key: SYSTEMML-2476 URL: https://issues.apache.org/jira/browse/SYSTEMML-2476 Project: SystemML Issue Type: Bug Reporter: LI Guobao When trying to use scalar casting to get element from a list, unexpected mapreduce tasks are launched instead of CP mode. The scenario is to replace *C = 1* with *C = as.scalar(hyperparams["C"])* inside the {{_gradient function_}} found in {{_src/test/scripts/functions/paramserv/mnist_lenet_paramserv.dml_}}. And then the problem could be reproduced by launching the method {{_testParamservBSPBatchDisjointContiguous_}} inside class _{{org.apache.sysml.test.integration.functions.paramserv.ParamservLocalNNTest}}_ Here is the stack: {code:java} 18/07/31 22:10:27 INFO mapred.MapTask: numReduceTasks: 1 18/07/31 22:10:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 18/07/31 22:10:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 18/07/31 22:10:27 INFO mapred.MapTask: soft limit at 83886080 18/07/31 22:10:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 18/07/31 22:10:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 18/07/31 22:10:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 18/07/31 22:10:27 INFO mapreduce.Job: Running job: job_local792652629_0008 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2469) Large distributed paramserv overheads
[ https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2469. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Large distributed paramserv overheads > - > > Key: SYSTEMML-2469 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2469 > Project: SystemML > Issue Type: Bug >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > Initial runs with the distributed paramserv implementation on a small cluster > revealed that it is working correctly while exhibiting large overheads. Below > are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster > of 1+6 nodes (24 cores per worker node). > {code} > otal elapsed time: 687.743 sec. > Total compilation time: 3.815 sec. > Total execution time: 683.928 sec. > Number of compiled Spark inst: 330. > Number of executed Spark inst: 0. > Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2. > Cache writes (WB, FS, HDFS):29856/5271/0. > Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/1629. > HOP DAGs recompile time:4.878 sec. > Functions recompiled: 1. > Functions recompile time: 0.097 sec. > Spark ctx create time (lazy): 22.222 sec. > Spark trans counts (par,bc,col):2/1/0. > Spark trans times (par,bc,col): 0.390/0.242/0.000 secs. > Paramserv total num workers:144. > Paramserv setup time: 68.259 secs. > Paramserv grad compute time:6952.163 secs. > Paramserv model update time:2453.448/422.955 secs. > Paramserv model broadcast time: 24.982 secs. > Paramserv batch slice time: 0.204 secs. > Paramserv RPC request time: 51611.210 secs. > ParFor loops optimized: 1. > ParFor optimize time: 0.462 sec. > ParFor initialize time: 0.049 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/188/188 > Total JIT compile time: 98.786 sec. > Total JVM GC count: 68. > Total JVM GC time: 25.858 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 paramserv665.479 1 > 2 +182.410 18636 > 3 conv2d_bias_add 150.938376 > 4 sqrt 69.768 11528 > 5 / 54.836 11732 > 6 ba+* 45.901376 > 7 * 38.046 11727 > 8 - 37.428 12096 > 9 ^235.533 6344 > 10 exp 21.022188 > {code} > There seem to be three distinct issues: > * Too larger number of tasks on assembling the distributed input data (in the > number of rows, i.e., >50,000 tasks), which makes the distributed data > partitioning very slow (multiple minutes). > * Evictions from the buffer pool at the driver node (see cache writes). This > is likely due to disabling cleanup (and missing explicit cleanup) of all RPC > objects. > * Large RPC overhead: This might be due to the evictions happening in the > critical path and all 144 workers waiting with their RPC requests. However, > in addition we should also double check that the number of RPC handler > threads is correct, if we could get the serialization and communication out > of the critical (i.e., synchronized) path of model updates, and address > unnecessary serialization/deserialization overheads. > [~Guobao] I'll help reducing the serialization/deserialization overheads, but > it would be great if you could have a look into the other issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2471) Add java doc
[ https://issues.apache.org/jira/browse/SYSTEMML-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2471. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Add java doc > > > Key: SYSTEMML-2471 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2471 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2424) Determine the level of par
[ https://issues.apache.org/jira/browse/SYSTEMML-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2424. --- Resolution: Fixed Fix Version/s: SystemML 1.2 > Determine the level of par > -- > > Key: SYSTEMML-2424 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2424 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > It aims to determine the parallelism level according to the cluster resource, > i.e., the total number of vcores. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2471) Add java doc
LI Guobao created SYSTEMML-2471: --- Summary: Add java doc Key: SYSTEMML-2471 URL: https://issues.apache.org/jira/browse/SYSTEMML-2471 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2087. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > This part aims to implement the parameter server for spark distributed > backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2420. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage netty framework to implement our own Rpc framework. In general, the > netty {{TransportClient}} and {{TransportServer}} provides the sending and > receiving service for ps and workers. Extending the {{RpcHandler}} allows to > invoke the corresponding ps method (i.e., push/pull method) by handling the > different input Rpc call object. And then the {{SparkPsProxy}} wrapping > {{TransportClient}} allows the workers to execute the push/pull call to > server. At the same time, the ps netty server also provides the file > repository service which allows the workers to download the partitioned > training data, so that the workers could rebuild the matrix object with the > transfered file instead of broadcasting all the files with spark which are > not all necessary for each worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2469) Large distributed paramserv overheads
[ https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2469: --- Assignee: LI Guobao > Large distributed paramserv overheads > - > > Key: SYSTEMML-2469 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2469 > Project: SystemML > Issue Type: Bug >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > Initial runs with the distributed paramserv implementation on a small cluster > revealed that it is working correctly while exhibiting large overheads. Below > are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster > of 1+6 nodes (24 cores per worker node). > {code} > otal elapsed time: 687.743 sec. > Total compilation time: 3.815 sec. > Total execution time: 683.928 sec. > Number of compiled Spark inst: 330. > Number of executed Spark inst: 0. > Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2. > Cache writes (WB, FS, HDFS):29856/5271/0. > Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/1629. > HOP DAGs recompile time:4.878 sec. > Functions recompiled: 1. > Functions recompile time: 0.097 sec. > Spark ctx create time (lazy): 22.222 sec. > Spark trans counts (par,bc,col):2/1/0. > Spark trans times (par,bc,col): 0.390/0.242/0.000 secs. > Paramserv total num workers:144. > Paramserv setup time: 68.259 secs. > Paramserv grad compute time:6952.163 secs. > Paramserv model update time:2453.448/422.955 secs. > Paramserv model broadcast time: 24.982 secs. > Paramserv batch slice time: 0.204 secs. > Paramserv RPC request time: 51611.210 secs. > ParFor loops optimized: 1. > ParFor optimize time: 0.462 sec. > ParFor initialize time: 0.049 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/188/188 > Total JIT compile time: 98.786 sec. > Total JVM GC count: 68. > Total JVM GC time: 25.858 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 paramserv665.479 1 > 2 +182.410 18636 > 3 conv2d_bias_add 150.938376 > 4 sqrt 69.768 11528 > 5 / 54.836 11732 > 6 ba+* 45.901376 > 7 * 38.046 11727 > 8 - 37.428 12096 > 9 ^235.533 6344 > 10 exp 21.022188 > {code} > There seem to be three distinct issues: > * Too larger number of tasks on assembling the distributed input data (in the > number of rows, i.e., >50,000 tasks), which makes the distributed data > partitioning very slow (multiple minutes). > * Evictions from the buffer pool at the driver node (see cache writes). This > is likely due to disabling cleanup (and missing explicit cleanup) of all RPC > objects. > * Large RPC overhead: This might be due to the evictions happening in the > critical path and all 144 workers waiting with their RPC requests. However, > in addition we should also double check that the number of RPC handler > threads is correct, if we could get the serialization and communication out > of the critical (i.e., synchronized) path of model updates, and address > unnecessary serialization/deserialization overheads. > [~Guobao] I'll help reducing the serialization/deserialization overheads, but > it would be great if you could have a look into the other issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1
[ https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2466. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Distributed paramserv fails on newer Spark version > 2.1 > > > Key: SYSTEMML-2466 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2466 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/network/util/SystemPropertyConfigProvider > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163) > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116) > at > org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.util.SystemPropertyConfigProvider > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1
[ https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558420#comment-16558420 ] LI Guobao edited comment on SYSTEMML-2466 at 7/26/18 3:23 PM: -- Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it appropriate for you? Or do I need to push it with another seperate PR without the commits of deep serialization in order to test the difference of perf? Just let me know. was (Author: guobao): Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it appropriate for you? Or do I need to push it with another seperate PR without the commits of deep serialization? > Distributed paramserv fails on newer Spark version > 2.1 > > > Key: SYSTEMML-2466 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2466 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/network/util/SystemPropertyConfigProvider > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163) > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116) > at > org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.util.SystemPropertyConfigProvider > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1
[ https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558420#comment-16558420 ] LI Guobao commented on SYSTEMML-2466: - Hi [~mboehm7], I pushed the modification for this issue in my latest PR. Is it appropriate for you? Or do I need to push it with another seperate PR without the commits of deep serialization? > Distributed paramserv fails on newer Spark version > 2.1 > > > Key: SYSTEMML-2466 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2466 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/network/util/SystemPropertyConfigProvider > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163) > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116) > at > org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.util.SystemPropertyConfigProvider > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2466) Distributed paramserv fails on newer Spark version > 2.1
[ https://issues.apache.org/jira/browse/SYSTEMML-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2466: --- Assignee: LI Guobao > Distributed paramserv fails on newer Spark version > 2.1 > > > Key: SYSTEMML-2466 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2466 > Project: SystemML > Issue Type: Task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/network/util/SystemPropertyConfigProvider > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.runOnSpark(ParamservBuiltinCPInstruction.java:163) > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:113) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116) > at > org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:106) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:487) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:272) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:195) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.network.util.SystemPropertyConfigProvider > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2465) Keep data consistency for a pre-trained model
LI Guobao created SYSTEMML-2465: --- Summary: Keep data consistency for a pre-trained model Key: SYSTEMML-2465 URL: https://issues.apache.org/jira/browse/SYSTEMML-2465 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao In distributed spark backend, pass a given pre-trained model to the paramserv function may cause the data inconsistency. Because the pre-trained model would be cached in driver's memory. In this case, when kicking off the paramserv func, the workers firstly will try to read the data from HDFS where the dirty data in pre-trained model has not been persisted. This leads to a inconsistency. So the idea is to export the dirty data to HDFS before kicking off the remote workers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2090) Documentation of language extension
[ https://issues.apache.org/jira/browse/SYSTEMML-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2090: --- Assignee: LI Guobao > Documentation of language extension > --- > > Key: SYSTEMML-2090 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2090 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2423) Implementation of spark ps
[ https://issues.apache.org/jira/browse/SYSTEMML-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2423. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Implementation of spark ps > -- > > Key: SYSTEMML-2423 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2423 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2457) Error handling and add statistics for spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2457: Summary: Error handling and add statistics for spark backend (was: Error handling and add statistic for spark backend) > Error handling and add statistics for spark backend > --- > > Key: SYSTEMML-2457 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2457 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2458) Add experiment on spark paramserv
[ https://issues.apache.org/jira/browse/SYSTEMML-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554752#comment-16554752 ] LI Guobao commented on SYSTEMML-2458: - [~mboehm7], I've pushed the scripts for the distributed spark experiments. Could you please take a look on that? > Add experiment on spark paramserv > - > > Key: SYSTEMML-2458 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2458 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2414) Paramserv zero accuracy with Overlap_Reshuffle
[ https://issues.apache.org/jira/browse/SYSTEMML-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2414. --- > Paramserv zero accuracy with Overlap_Reshuffle > -- > > Key: SYSTEMML-2414 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2414 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2422) Implementation of remote worker
[ https://issues.apache.org/jira/browse/SYSTEMML-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2422. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Implementation of remote worker > --- > > Key: SYSTEMML-2422 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2422 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2458) Add experiment on spark paramserv
LI Guobao created SYSTEMML-2458: --- Summary: Add experiment on spark paramserv Key: SYSTEMML-2458 URL: https://issues.apache.org/jira/browse/SYSTEMML-2458 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2443) Add experiments varied on optimizers
[ https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2443. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Add experiments varied on optimizers > > > Key: SYSTEMML-2443 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2443 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > It aims to add the scripts of doing the experiments for local ps and to > explore the training result with the different optimizers (sgd, adagrad, > adam). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2457) Error handling and add statistic for spark backend
LI Guobao created SYSTEMML-2457: --- Summary: Error handling and add statistic for spark backend Key: SYSTEMML-2457 URL: https://issues.apache.org/jira/browse/SYSTEMML-2457 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2457) Error handling and add statistic for spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2457. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Error handling and add statistic for spark backend > -- > > Key: SYSTEMML-2457 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2457 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2419. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing
[ https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao reassigned SYSTEMML-2446: --- Assignee: LI Guobao > Paramserv adagrad ASP batch disjoint_continuous failing > --- > > Key: SYSTEMML-2446 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2446 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Caused by: java.io.IOException: File > scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on > HDFS/LFS. > at > org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120) > at > org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing
[ https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546134#comment-16546134 ] LI Guobao edited comment on SYSTEMML-2446 at 7/17/18 7:23 AM: -- [~mboehm7] sure. was (Author: guobao): [~mboehm7]sure. > Paramserv adagrad ASP batch disjoint_continuous failing > --- > > Key: SYSTEMML-2446 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2446 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Caused by: java.io.IOException: File > scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on > HDFS/LFS. > at > org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120) > at > org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2446) Paramserv adagrad ASP batch disjoint_continuous failing
[ https://issues.apache.org/jira/browse/SYSTEMML-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546134#comment-16546134 ] LI Guobao commented on SYSTEMML-2446: - [~mboehm7]sure. > Paramserv adagrad ASP batch disjoint_continuous failing > --- > > Key: SYSTEMML-2446 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2446 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > {code} > Caused by: java.io.IOException: File > scratch_space/_p152255_9.1.44.68/_t0/temp10100_7141 does not exist on > HDFS/LFS. > at > org.apache.sysml.runtime.io.MatrixReader.checkValidInputFile(MatrixReader.java:120) > at > org.apache.sysml.runtime.io.ReaderBinaryCell.readMatrixFromHDFS(ReaderBinaryCell.java:51) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:197) > at > org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:164) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:434) > at > org.apache.sysml.runtime.controlprogram.caching.MatrixObject.readBlobFromHDFS(MatrixObject.java:59) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.readBlobFromHDFS(CacheableData.java:886) > at > org.apache.sysml.runtime.controlprogram.caching.CacheableData.acquireReadIntern(CacheableData.java:434) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2443) Add experiments varied on optimizers
[ https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544369#comment-16544369 ] LI Guobao commented on SYSTEMML-2443: - Hi [~mboehm7], currently, I'm trying to do the experiment for local ps varied on optimizers. And I have already pushed the new scripts on github. In the meantime, I launched them firstly on my laptop with the mnist60k. But it seems that there are not too much difference on model precision. Hence, I wonder if we should apply them on some other typical training data instead of mnist. For example, are there any data relatively more complex to be trained? > Add experiments varied on optimizers > > > Key: SYSTEMML-2443 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2443 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to add the scripts of doing the experiments for local ps and to > explore the training result with the different optimizers (sgd, adagrad, > adam). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2443) Add experiments varied on optimizers
[ https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2443: Summary: Add experiments varied on optimizers (was: Add experiment varied on optimizer) > Add experiments varied on optimizers > > > Key: SYSTEMML-2443 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2443 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to add the scripts of doing the experiments for local ps and to > explore the difference on the optimizers (sgd, adagrad, adam). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2443) Add experiments varied on optimizers
[ https://issues.apache.org/jira/browse/SYSTEMML-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2443: Description: It aims to add the scripts of doing the experiments for local ps and to explore the training result with the different optimizers (sgd, adagrad, adam). (was: It aims to add the scripts of doing the experiments for local ps and to explore the difference on the optimizers (sgd, adagrad, adam).) > Add experiments varied on optimizers > > > Key: SYSTEMML-2443 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2443 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to add the scripts of doing the experiments for local ps and to > explore the training result with the different optimizers (sgd, adagrad, > adam). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2301) Second evaluation
[ https://issues.apache.org/jira/browse/SYSTEMML-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2301. --- Resolution: Fixed Assignee: LI Guobao Fix Version/s: SystemML 1.2 > Second evaluation > - > > Key: SYSTEMML-2301 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2301 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2443) Add experiment varied on optimizer
LI Guobao created SYSTEMML-2443: --- Summary: Add experiment varied on optimizer Key: SYSTEMML-2443 URL: https://issues.apache.org/jira/browse/SYSTEMML-2443 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to add the scripts of doing the experiments for local ps and to explore the difference on the optimizers (sgd, adagrad, adam). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2440) Got zero when casting an element of list
LI Guobao created SYSTEMML-2440: --- Summary: Got zero when casting an element of list Key: SYSTEMML-2440 URL: https://issues.apache.org/jira/browse/SYSTEMML-2440 Project: SystemML Issue Type: Bug Reporter: LI Guobao When running paramserv experiments, try to use get the element in a list in the function. I got zero value. {code:java} stride = as.integer(as.scalar(hyperparams["stride"])) pad = as.integer(as.scalar(hyperparams["pad"])) lambda = as.double(as.scalar(hyperparams["lambda"])){code} {code:java} Caused by: java.lang.RuntimeException: Incorrect parameters: height=0 filter_height=0 stride=0 pad=0 at org.apache.sysml.runtime.util.DnnUtils.getP(DnnUtils.java:43) at org.apache.sysml.runtime.instructions.cp.DnnCPInstruction.processInstruction(DnnCPInstruction.java:457) at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) ... 12 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2414) Paramserv zero accuracy with Overlap_Reshuffle
[ https://issues.apache.org/jira/browse/SYSTEMML-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2414. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Paramserv zero accuracy with Overlap_Reshuffle > -- > > Key: SYSTEMML-2414 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2414 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2412) Paramserv "all the same accuracy" problem
[ https://issues.apache.org/jira/browse/SYSTEMML-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2412. --- > Paramserv "all the same accuracy" problem > - > > Key: SYSTEMML-2412 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2412 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > We came across the problem that all the model accuracy are the same. One of > the suspended bug is that the batchsize in validation method is inconsistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (SYSTEMML-2403) Paramserv low accuracy sometimes occurred
[ https://issues.apache.org/jira/browse/SYSTEMML-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao closed SYSTEMML-2403. --- > Paramserv low accuracy sometimes occurred > - > > Key: SYSTEMML-2403 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2403 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > We could observe that the low accuracy will sometimes occurred. Here is the > scenario: _BSP, BATCH, DISJOINT_CONTIGUOUS (or DISJOINT_RANDOM)_ with _1 > epoch, 4 workers (I have 4 vcores in my machine) and batchSize 16_ using 60k > minist. > {code:java} > Val Loss: 2.3006845853187783 > Val Accuracy: 0.11184 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2412) Paramserv "all the same accuracy" problem
[ https://issues.apache.org/jira/browse/SYSTEMML-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2412. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Paramserv "all the same accuracy" problem > - > > Key: SYSTEMML-2412 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2412 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > We came across the problem that all the model accuracy are the same. One of > the suspended bug is that the batchsize in validation method is inconsistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2403) Paramserv low accuracy sometimes occurred
[ https://issues.apache.org/jira/browse/SYSTEMML-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2403. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Paramserv low accuracy sometimes occurred > - > > Key: SYSTEMML-2403 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2403 > Project: SystemML > Issue Type: Bug >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > We could observe that the low accuracy will sometimes occurred. Here is the > scenario: _BSP, BATCH, DISJOINT_CONTIGUOUS (or DISJOINT_RANDOM)_ with _1 > epoch, 4 workers (I have 4 vcores in my machine) and batchSize 16_ using 60k > minist. > {code:java} > Val Loss: 2.3006845853187783 > Val Accuracy: 0.11184 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao resolved SYSTEMML-2418. - Resolution: Fixed Fix Version/s: SystemML 1.2 > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > In the context of ml, it would be more efficient to support the data > partitioning in distributed manner. This task aims to do the data > partitioning on Spark which means that all the data will be firstly splitted > among workers and then execute data partitioning on worker side according to > scheme, and then the partitioned data which stay on each worker could be > directly passed to run model training work without materialization on HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835 ] LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:21 AM: --- [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is about the codegen class. Aiming to avoid the concurrently and redundantly reloading of the class, what will the codegen class concretely be? Because when parsing the parfor body, it will generate a codegen class map. was (Author: guobao): [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is about the codegen class. Aiming to avoid the concurrently reloading of the class, what will the codegen class concretely be? Because when parsing the parfor body, it will generate a codegen class map. > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835 ] LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:21 AM: --- [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is about the codegen class. Aiming to avoid the concurrently and redundantly loading of the class, what will the codegen class concretely be? Because when parsing the parfor body, it will generate a codegen class map. was (Author: guobao): [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is about the codegen class. Aiming to avoid the concurrently and redundantly reloading of the class, what will the codegen class concretely be? Because when parsing the parfor body, it will generate a codegen class map. > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835 ] LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:20 AM: --- [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is about the codegen class. Aiming to avoid the concurrently reloading of the class, what will the codegen class concretely be? Because when parsing the parfor body, it will generate a codegen class map. was (Author: guobao): [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is, what is the codegen class concretely is and used for? Because when parsing the parfor body, it will generate a codegen class map. > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835 ] LI Guobao edited comment on SYSTEMML-2419 at 7/13/18 10:18 AM: --- [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is, what is the codegen class concretely is and used for? Because when parsing the parfor body, it will generate a codegen class map. was (Author: guobao): [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is, what is the codegen class used for? Because when parsing the parfor body, it will generate a codegen class map. > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542835#comment-16542835 ] LI Guobao commented on SYSTEMML-2419: - [~mboehm7], I have some questions. The first is about the setup of remote parfor worker. In fact, I saw that this block of code is synchronized and so I wonder if it means that in one executor, the parfor task will be launched in multi-threaded way? The second is, what is the codegen class used for? Because when parsing the parfor body, it will generate a codegen class map. > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2299) API design of the paramserv function
[ https://issues.apache.org/jira/browse/SYSTEMML-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540715#comment-16540715 ] LI Guobao commented on SYSTEMML-2299: - [~mboehm7], in fact, until now, it seems that the arguments "val_features" and "val_labels" have not been used inside paramserv function. For me, they are leveraged to calculate the model precision and we use a UDF to do it but not inside our paramserv function. So do they have some other utilities? > API design of the paramserv function > > > Key: SYSTEMML-2299 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2299 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Fix For: SystemML 1.2 > > > The objective of “paramserv” built-in function is to update an initial or > existing model with configuration. An initial function signature would be: > {code:java} > model'=paramserv(model=paramsList, features=X, labels=Y, val_features=X_val, > val_labels=Y_val, upd="fun1", agg="fun2", mode="LOCAL", utype="BSP", > freq="BATCH", epochs=100, batchsize=64, k=7, scheme="disjoint_contiguous", > hyperparams=params, checkpointing="NONE"){code} > We are interested in providing the model (which will be a struct-like data > structure consisting of the weights, the biases and the hyperparameters), the > training features and labels, the validation features and labels, the batch > update function (i.e., gradient calculation func), the update strategy (e.g. > sync, async, hogwild!, stale-synchronous), the update frequency (e.g. epoch > or mini-batch), the gradient aggregation function, the number of epoch, the > batch size, the degree of parallelism, the data partition scheme, a list of > additional hyper parameters, as well as the checkpointing strategy. And the > function will return a trained model in struct format. > *Inputs*: > * model : a list consisting of the weight and bias matrices > * features : training features matrix > * labels : training label matrix > * val_features : validation features matrix > * val_labels : validation label matrix > * upd : the name of gradient calculation function > * agg : the name of gradient aggregation function > * mode (options: LOCAL, REMOTE_SPARK): the execution backend where > the parameter is executed > * utype (options: BSP, ASP, SSP): the updating mode > * freq [optional] (default: BATCH) (options: EPOCH, BATCH) : the > frequence of updates > * epochs : the number of epoch > * batchsize [optional] (default: 64): the size of batch, if the > update frequence is "EPOCH", this argument will be ignored > * k [optional] (default: number of vcores, otherwise vcores / 2 if > using openblas): the degree of parallelism > * scheme [optional] (default: disjoint_contiguous) (options: > disjoint_contiguous, disjoint_round_robin, disjoint_random, > overlap_reshuffle): the scheme of data partition, i.e., how the data is > distributed across workers > * hyperparams [optional]: a list consisting of the additional hyper > parameters, e.g., learning rate, momentum > * checkpointing [optional] (default: NONE) (options: NONE, EPOCH, > EPOCH10) : the checkpoint strategy, we could set a checkpoint for each epoch > or each 10 epochs > *Output*: > * model' : a list consisting of the updated weight and bias matrices -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534950#comment-16534950 ] LI Guobao edited comment on SYSTEMML-2419 at 7/6/18 3:07 PM: - [~mboehm7] , I have a problem when serializing the instructions. I got some Spark instruction which could not be serialized. Hence, my question is that should we recreate the instruction by forcing the HOPs with CP type. And also, I'd like to know how do Parfor handle this case? Or it will not generate the SP instructions? Here is the stack: {code:java} Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions of type other than CP instructions org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE {code} was (Author: guobao): [~mboehm7] , I have a problem when serializing the instructions. I got some Spark instruction which could not be serialized. Hence, my question is that should we recreate the instruction by forcing the HOPs with CP type. Here is the stack: {code:java} Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions of type other than CP instructions org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE {code} > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534950#comment-16534950 ] LI Guobao commented on SYSTEMML-2419: - [~mboehm7] , I have a problem when serializing the instructions. I got some Spark instruction which could not be serialized. Hence, my question is that should we recreate the instruction by forcing the HOPs with CP type. Here is the stack: {code:java} Caused by: org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions of type other than CP instructions org.apache.sysml.runtime.instructions.spark.BinaryMatrixScalarSPInstruction SPARK°max°0·SCALAR·INT·true°_mVar1279·MATRIX·DOUBLE°_mVar1280·MATRIX·DOUBLE {code} > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2419: Description: In the context of distributed spark env, we need to firstly ship the necessary functions and variables to the remote workers and then to initialize and register the cleanup of buffer pool for each remote worker. All these are inspired by the parfor implementation. (was: In the context of distributed spark env, we need to initialize and register the cleanup of buffer pool for each remote worker. It could be inspired by the parfor implementation.) > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of distributed spark env, we need to firstly ship the > necessary functions and variables to the remote workers and then to > initialize and register the cleanup of buffer pool for each remote worker. > All these are inspired by the parfor implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Description: In the context of ml, it would be more efficient to support the data partitioning in distributed manner. This task aims to do the data partitioning on Spark which means that all the data will be firstly splitted among workers and then execute data partitioning on worker side according to scheme, and then the partitioned data which stay on each worker could be directly passed to run model training work without materialization on HDFS. (was: In the context of ml, it would be more efficient to support the data partitioning in distributed manner. This task aims to do the data partitioning on Spark which means that all the data will be firstly splitted among workers and then execute data partitioning on worker side according to scheme, and then the partitioned data which stay on each worker could be directly passed to run model training work.) > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ml, it would be more efficient to support the data > partitioning in distributed manner. This task aims to do the data > partitioning on Spark which means that all the data will be firstly splitted > among workers and then execute data partitioning on worker side according to > scheme, and then the partitioned data which stay on each worker could be > directly passed to run model training work without materialization on HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Description: In the context of ml, it would be more efficient to support the data partitioning in distributed manner. This task aims to do the data partitioning on Spark which means that all the data will be firstly splitted among workers and then execute data partitioning on worker side according to scheme, and then the partitioned data which stay on each worker could be directly passed to run model training work. (was: In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file in scratch place.) > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ml, it would be more efficient to support the data > partitioning in distributed manner. This task aims to do the data > partitioning on Spark which means that all the data will be firstly splitted > among workers and then execute data partitioning on worker side according to > scheme, and then the partitioned data which stay on each worker could be > directly passed to run model training work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525299#comment-16525299 ] LI Guobao commented on SYSTEMML-2418: - [~mboehm7], is it correct my logic? By the way, I'd like to know if the scratch place is shared by all the remote workers? If so, the workers could load the file from this hdfs repository? > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ml, the training data will be usually overfitted in spark > driver node. So to partition such enormous data is no more feasible in CP. > This task aims to do the data partitioning in distributed way which means > that the workers will receive its split of training data and do the data > partition locally according to different schemes. And then all the data will > be grouped by the given key (i.e., the worker id) and at last be written into > the seperate HDFS file in scratch place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Description: In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file in scratch place. (was: In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file.) > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ml, the training data will be usually overfitted in spark > driver node. So to partition such enormous data is no more feasible in CP. > This task aims to do the data partitioning in distributed way which means > that the workers will receive its split of training data and do the data > partition locally according to different schemes. And then all the data will > be grouped by the given key (i.e., the worker id) and at last be written into > the seperate HDFS file in scratch place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Description: In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file. (was: In the context of ps, the training data will be partitioned according to the different schemes. This conversion is executed in driver node and the partitioned data should be distributed to workers via broadcast. Due to the 2G limitation of spark broadcast, we could leverage the _PartitionedBroadcast_ class to do this conversion. Afterwards, the partitioned broadcast object can be passed to workers for launching its job.) > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ml, the training data will be usually overfitted in spark > driver node. So to partition such enormous data is no more feasible in CP. > This task aims to do the data partitioning in distributed way which means > that the workers will receive its split of training data and do the data > partition locally according to different schemes. And then all the data will > be grouped by the given key (i.e., the worker id) and at last be written into > the seperate HDFS file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Spark data partitioner
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Summary: Spark data partitioner (was: Distributing data to workers) > Spark data partitioner > -- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ps, the training data will be partitioned according to the > different schemes. This conversion is executed in driver node and the > partitioned data should be distributed to workers via broadcast. Due to the > 2G limitation of spark broadcast, we could leverage the > _PartitionedBroadcast_ class to do this conversion. Afterwards, the > partitioned broadcast object can be passed to workers for launching its job. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles
[ https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2421: Description: It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model which consists of each worker's process (i.e., which batch iteration and epoch is being executing). And the checkpointing could be set to EPOCH10 which means that every 10 epoch the state will be persisted in centralized file on server side. (was: It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in centralized file on server side.) > Task error and preemption handles > - > > Key: SYSTEMML-2421 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2421 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to introduce the checkpointing to guarantee that the worker could > recover from previous failure. In details, once a worker is brought up it > pulls the current state of the model which consists of each worker's process > (i.e., which batch iteration and epoch is being executing). And the > checkpointing could be set to EPOCH10 which means that every 10 epoch the > state will be persisted in centralized file on server side. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles
[ https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2421: Description: It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in centralized file on server side. (was: It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in a file on worker side.) > Task error and preemption handles > - > > Key: SYSTEMML-2421 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2421 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to introduce the checkpointing to guarantee that the worker could > recover from previous failure. In details, once a worker is brought up it > pulls the current state of the model. And the checkpointing could be set to > be EPOCH10 which means that every 10 epoch the state will be persisted in > centralized file on server side. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2421) Task error and preemption handles
[ https://issues.apache.org/jira/browse/SYSTEMML-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2421: Description: It aims to introduce the checkpointing to guarantee that the worker could recover from previous failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in a file on worker side. (was: It aims to introduce the checkpointing to guarantee that the task could recover from failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in a file.) > Task error and preemption handles > - > > Key: SYSTEMML-2421 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2421 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to introduce the checkpointing to guarantee that the worker could > recover from previous failure. In details, once a worker is brought up it > pulls the current state of the model. And the checkpointing could be set to > be EPOCH10 which means that every 10 epoch the state will be persisted in a > file on worker side. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522920#comment-16522920 ] LI Guobao commented on SYSTEMML-2420: - [~mboehm7], thanks for the feedback. And I decided to take the latter option which means implementing our own Rpc communication. I have uploaded some diagrams as well as updating the description. I'm looking forward to your review. > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage netty framework to implement our own Rpc framework. In general, the > netty {{TransportClient}} and {{TransportServer}} provides the sending and > receiving service for ps and workers. Extending the {{RpcHandler}} allows to > invoke the corresponding ps method (i.e., push/pull method) by handling the > different input Rpc call object. And then the {{SparkPsProxy}} wrapping > {{TransportClient}} allows the workers to execute the push/pull call to > server. At the same time, the ps netty server also provides the file > repository service which allows the workers to download the partitioned > training data, so that the workers could rebuild the matrix object with the > transfered file instead of broadcasting all the files with spark which are > not all necessary for each worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Attachment: systemml_rpc_class_diagram.png > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_class_diagram.png, systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage netty framework to implement our own Rpc framework. In general, the > netty {{TransportClient}} and {{TransportServer}} provides the sending and > receiving service for ps and workers. Extending the {{RpcHandler}} allows to > invoke the corresponding ps method (i.e., push/pull method) by handling the > different input Rpc call object. And then the {{SparkPsProxy}} wrapping > {{TransportClient}} allows the workers to execute the push/pull call to > server. At the same time, the ps netty server also provides the file > repository service which allows the workers to download the partitioned > training data, so that the workers could rebuild the matrix object with the > transfered file instead of broadcasting all the files with spark which are > not all necessary for each worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Description: It aims to implement the parameter exchange between ps and workers. We could leverage netty framework to implement our own Rpc framework. In general, the netty {{TransportClient}} and {{TransportServer}} provides the sending and receiving service for ps and workers. Extending the {{RpcHandler}} allows to invoke the corresponding ps method (i.e., push/pull method) by handling the different input Rpc call object. And then the {{SparkPsProxy}} wrapping {{TransportClient}} allows the workers to execute the push/pull call to server. At the same time, the ps netty server also provides the file repository service which allows the workers to download the partitioned training data, so that the workers could rebuild the matrix object with the transfered file instead of broadcasting all the files with spark which are not all necessary for each worker. (was: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. Hence we need to wrap them in in order to be used in Java. Overall, we could register the ps service with _RpcEndpoint_ and invoke the service with _RpcEndpointRef_.) > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage netty framework to implement our own Rpc framework. In general, the > netty {{TransportClient}} and {{TransportServer}} provides the sending and > receiving service for ps and workers. Extending the {{RpcHandler}} allows to > invoke the corresponding ps method (i.e., push/pull method) by handling the > different input Rpc call object. And then the {{SparkPsProxy}} wrapping > {{TransportClient}} allows the workers to execute the push/pull call to > server. At the same time, the ps netty server also provides the file > repository service which allows the workers to download the partitioned > training data, so that the workers could rebuild the matrix object with the > transfered file instead of broadcasting all the files with spark which are > not all necessary for each worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Attachment: systemml_rpc_sequence_diagram.png > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. > Hence we need to wrap them in in order to be used in Java. Overall, we could > register the ps service with _RpcEndpoint_ and invoke the service with > _RpcEndpointRef_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Attachment: systemml_rpc_2_seq_diagram.png > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > Attachments: systemml_rpc_2_seq_diagram.png, > systemml_rpc_sequence_diagram.png > > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. > Hence we need to wrap them in in order to be used in Java. Overall, we could > register the ps service with _RpcEndpoint_ and invoke the service with > _RpcEndpointRef_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521704#comment-16521704 ] LI Guobao commented on SYSTEMML-2420: - [~mboehm7], I wrote down the idea for implementing the exchange between ps and workers. And is this idea correct? > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. > Hence we need to wrap them in in order to be used in Java. Overall, we could > register the ps service with _RpcEndpoint_ and invoke the service with > _RpcEndpointRef_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Description: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. Hence we need to wrap them in in order to be used in Java. Overall, we could register the ps service with _RpcEndpoint_ and invoke the service with _RpcEndpointRef_. (was: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. But we could easily wrap them in java class to be reused. And then the two methods (push, pull) could be wrapped into this defined endpoint.) > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. > Hence we need to wrap them in in order to be used in Java. Overall, we could > register the ps service with _RpcEndpoint_ and invoke the service with > _RpcEndpointRef_. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Description: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. But we could easily wrap them in java class to be reused. And then the two methods (push, pull) could be wrapped into this defined endpoint. (was: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. But we could easily wrap them in java class to be reused.) > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. But > we could easily wrap them in java class to be reused. And then the two > methods (push, pull) could be wrapped into this defined endpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2420) Communication between ps and workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2420: Description: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. We can learn that the spark rpc is implemented in Scala. But we could easily wrap them in java class to be reused. (was: It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side.) > Communication between ps and workers > > > Key: SYSTEMML-2420 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to implement the parameter exchange between ps and workers. We could > leverage spark RPC to setup a ps endpoint in driver node which means that the > ps service could be discovered by workers in the network. And then the > workers could invoke the pull/push method via RPC using the registered > endpoint of ps service. Hence, in details, this tasks consists of registering > the ps endpoint in spark rpc framework and using rpc to invoke target method > in worker side. We can learn that the spark rpc is implemented in Scala. But > we could easily wrap them in java class to be reused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2424) Determine the level of par
[ https://issues.apache.org/jira/browse/SYSTEMML-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2424: Description: It aims to determine the parallelism level according to the cluster resource, i.e., the total number of vcores. (was: It aims to determine the parallelism level according to the cluster resource, i.e., the vcore of each executor.) > Determine the level of par > -- > > Key: SYSTEMML-2424 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2424 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > It aims to determine the parallelism level according to the cluster resource, > i.e., the total number of vcores. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2424) Determine the level of par
LI Guobao created SYSTEMML-2424: --- Summary: Determine the level of par Key: SYSTEMML-2424 URL: https://issues.apache.org/jira/browse/SYSTEMML-2424 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to determine the parallelism level according to the cluster resource, i.e., the vcore of each executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2423) Implementation of spark ps
LI Guobao created SYSTEMML-2423: --- Summary: Implementation of spark ps Key: SYSTEMML-2423 URL: https://issues.apache.org/jira/browse/SYSTEMML-2423 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2422) Implementation of remote worker
LI Guobao created SYSTEMML-2422: --- Summary: Implementation of remote worker Key: SYSTEMML-2422 URL: https://issues.apache.org/jira/browse/SYSTEMML-2422 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2418) Distributing data to workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2418: Description: In the context of ps, the training data will be partitioned according to the different schemes. This conversion is executed in driver node and the partitioned data should be distributed to workers via broadcast. Due to the 2G limitation of spark broadcast, we could leverage the _PartitionedBroadcast_ class to do this conversion. Afterwards, the partitioned broadcast object can be passed to workers for launching its job. > Distributing data to workers > > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > > In the context of ps, the training data will be partitioned according to the > different schemes. This conversion is executed in driver node and the > partitioned data should be distributed to workers via broadcast. Due to the > 2G limitation of spark broadcast, we could leverage the > _PartitionedBroadcast_ class to do this conversion. Afterwards, the > partitioned broadcast object can be passed to workers for launching its job. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2087) Initial version of distributed spark backend
[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2087: Description: This part aims to implement the parameter server for spark distributed backend. (was: This part aims to implement the parameter server for spark distributed backend. In general, the implementation of ps is very close to local ps. The ps provides the pull/push service to the workers in driver node whereas the communication between ps and workers will be done vias RPC. And then the data needs to be distributed to the workers according to the different data partition schemes. The worker setup and cleanup is different from the local one which needs to be handled.) > Initial version of distributed spark backend > > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task >Reporter: Matthias Boehm >Assignee: LI Guobao >Priority: Major > > This part aims to implement the parameter server for spark distributed > backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2421) Task error and preemption handles
LI Guobao created SYSTEMML-2421: --- Summary: Task error and preemption handles Key: SYSTEMML-2421 URL: https://issues.apache.org/jira/browse/SYSTEMML-2421 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to introduce the checkpointing to guarantee that the task could recover from failure. In details, once a worker is brought up it pulls the current state of the model. And the checkpointing could be set to be EPOCH10 which means that every 10 epoch the state will be persisted in a file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SYSTEMML-2420) Communication between ps and workers
LI Guobao created SYSTEMML-2420: --- Summary: Communication between ps and workers Key: SYSTEMML-2420 URL: https://issues.apache.org/jira/browse/SYSTEMML-2420 Project: SystemML Issue Type: Sub-task Reporter: LI Guobao Assignee: LI Guobao It aims to implement the parameter exchange between ps and workers. We could leverage spark RPC to setup a ps endpoint in driver node which means that the ps service could be discovered by workers in the network. And then the workers could invoke the pull/push method via RPC using the registered endpoint of ps service. Hence, in details, this tasks consists of registering the ps endpoint in spark rpc framework and using rpc to invoke target method in worker side. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SYSTEMML-2419) Setup and cleanup of remote workers
[ https://issues.apache.org/jira/browse/SYSTEMML-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LI Guobao updated SYSTEMML-2419: Summary: Setup and cleanup of remote workers (was: Setup of remote workers) > Setup and cleanup of remote workers > --- > > Key: SYSTEMML-2419 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2419 > Project: SystemML > Issue Type: Sub-task >Reporter: LI Guobao >Assignee: LI Guobao >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)