[ https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513646#comment-16513646 ]
LI Guobao commented on SYSTEMML-2397: ------------------------------------- In fact, I think the problem is that when assigning the level of par for each multi-threaded hop, would be correct to assign k only for the leaf hop and others 1 instead of assigning k for all hops? > Paramserv ASP failing w/ OOM (too many threads) > ----------------------------------------------- > > Key: SYSTEMML-2397 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2397 > Project: SystemML > Issue Type: Bug > Reporter: Matthias Boehm > Priority: Major > > Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM > despite 200GB max heap. [~Guobao] could you please have a look? I suspect > that the degree of parallelism of instructions is not set correctly leading > to 80x80 concurrent threads. The easiest way to debug would be to use > {{Explain.explain}} to the worker instructions and check that every > instruction has an assigned degree of parallelism of 1. > {code} > 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script. > org.apache.sysml.runtime.DMLRuntimeException: > org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program > block generated from statement block between lines 0 and 71 -- Error > evaluating instruction: > CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123) > at > org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:248) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error > in program block generated from statement block between lines 0 and 71 -- > Error evaluating instruction: > CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116) > ... 14 more > Caused by: org.apache.sysml.runtime.DMLRuntimeException: > ParamservBuiltinCPInstruction: some error occurred: > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > ... 17 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.OutOfMemoryError: unable to create new native thread > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158) > ... 18 more > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238) > at > org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76) > at > org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755) > at > org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284) > at > org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298) > at > org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161) > at > org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116) > at > org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152) > at > org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170) > at > org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79) > at > org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58) > at > org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)