[ https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566118#comment-16566118 ]
Matthias Boehm commented on SYSTEMML-2478: ------------------------------------------ Well, first of all we're not executing MR but SPARK instructions here. Second, yes, there seems to be an issue but I was not able to reproduce yet because (even after fixing the order of model entries to allow indexed access) there are still some incorrect lookups that ultimately result in dimension mismatches on aggregation with ADAM. So let's use the sequential aggregation for now and I have to come back to this later. > Overhead when using parfor in update func > ----------------------------------------- > > Key: SYSTEMML-2478 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2478 > Project: SystemML > Issue Type: Bug > Reporter: LI Guobao > Priority: Major > > When using parfor inside update function, some MR tasks are launched to write > the output of task. And it took more time to finish the paramserv run than > without parfor in update function. The scenario is to launch the ASP Epoch DC > spark paramserv test. > Here is the stack: > {code:java} > Total elapsed time: 101.804 sec. > Total compilation time: 3.690 sec. > Total execution time: 98.114 sec. > Number of compiled Spark inst: 302. > Number of executed Spark inst: 540. > Cache hits (Mem, WB, FS, HDFS): 57839/0/0/240. > Cache writes (WB, FS, HDFS): 14567/58/61. > Cache times (ACQr/m, RLS, EXP): 42.346/0.064/4.761/20.280 sec. > HOP DAGs recompiled (PRED, SB): 0/144. > HOP DAGs recompile time: 0.507 sec. > Functions recompiled: 16. > Functions recompile time: 0.064 sec. > Spark ctx create time (lazy): 1.376 sec. > Spark trans counts (par,bc,col):270/1/240. > Spark trans times (par,bc,col): 0.573/0.197/42.255 secs. > Paramserv total num workers: 3. > Paramserv setup time: 1.559 secs. > Paramserv grad compute time: 105.701 secs. > Paramserv model update time: 56.801/47.193 secs. > Paramserv model broadcast time: 23.872 secs. > Paramserv batch slice time: 0.000 secs. > Paramserv RPC request time: 105.159 secs. > ParFor loops optimized: 1. > ParFor optimize time: 0.040 sec. > ParFor initialize time: 0.434 sec. > ParFor result merge time: 0.005 sec. > ParFor total update in-place: 0/7/7 > Total JIT compile time: 68.384 sec. > Total JVM GC count: 1120. > Total JVM GC time: 22.338 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 paramserv 97.221 1 > 2 conv2d_bias_add 60.581 614 > 3 * 54.990 12447 > 4 sp_- 20.625 240 > 5 - 17.979 7287 > 6 + 14.191 12824 > 7 r' 5.636 1200 > 8 conv2d_backward_filter 5.123 600 > 9 max 4.985 907 > 10 ba+* 4.591 1814 > {code} > Here is the polished update func: > {code:java} > aggregation = function(list[unknown] model, > list[unknown] gradients, > list[unknown] hyperparams) > return (list[unknown] modelResult) { > lr = as.double(as.scalar(hyperparams["lr"])) > mu = as.double(as.scalar(hyperparams["mu"])) > modelResult = model > # Optimize with SGD w/ Nesterov momentum > parfor(i in 1:8, check=0) { > P = as.matrix(model[i]) > dP = as.matrix(gradients[i]) > vP = as.matrix(model[8+i]) > [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP) > modelResult[i] = P > modelResult[8+i] = vP > } > } > {code} > [~mboehm7], in fact, I have no idea where the cause comes from? It seems that > it tried to write the parfor task output into HDFS. So is it the normal > behavior? -- This message was sent by Atlassian JIRA (v7.6.3#76005)