[ https://issues.apache.org/jira/browse/SYSTEMML-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871492#comment-15871492 ]
Matthias Boehm commented on SYSTEMML-1243: ------------------------------------------ Just to clarify our default 100k scenario with 100 features runs fine. This scenario here however, uses 1000 features which makes stratstats more challenging. I was able to reproduce this OOM, even after the recent changes which already reduces memory pressure. The core problem comes form several matrix multiplication of the following form, where we've chosen mapmm (with repartitioning at runtime level in order to overcome Spark's 2GB limitation per partition). {code} mapmm: rdd [100000 x 1000, nnz=95000819, blocks (1000 x 1000)] 800MB mapmm: bc [1000 x 1000000, nnz=1000000, blocks (1000 x 1000)] 172MB --> output: 100000 x 1000000 {code} However, because the RDD has only 100 block - this gives us an upper bound on the maximum number of input partitions, hindering us from repartition this RDD to our preferred number of partitions which causes too large outputs per task (partition). I can think of three potential directions going forward: 1) Flip RDD and broadcast during runtime if we detect that it would be beneficial for repartitioning (in this case changing the upper bound by 10x) 2) Alternative matrix multiplication operations: Traditionally, we would have applied RMM for these scenarios but replication can similarly lead to large task outputs. Alternatively, we could consider enabling pmapmm for production use. 3) Extended permutation matrix multiply pmm: So far, we only support selection but no permutation matrices and we're only able to detect this within a DAG which would not apply here. One option would be to keep track of special producing operations and flag intermediates. > Perftest: OutOfMemoryError in stratstats.dml for 800MB case > ----------------------------------------------------------- > > Key: SYSTEMML-1243 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1243 > Project: SystemML > Issue Type: Bug > Components: Test > Affects Versions: SystemML 0.13 > Environment: spark 2.1.0 > Reporter: Imran Younus > Assignee: Matthias Boehm > Fix For: SystemML 0.13 > > Attachments: sparkDML.sh > > > when running {{runAllStats.sh}} script, {{stratstats.dml}} ends with > OutOfMemory error for 100k_1k data set. Here is end of log file: > {code} > 17/02/06 16:09:25 INFO api.DMLScript: SystemML Statistics: > Total elapsed time: 1435.880 sec. > Total compilation time: 2.433 sec. > Total execution time: 1433.447 sec. > Number of compiled Spark inst: 190. > Number of executed Spark inst: 3. > Cache hits (Mem, WB, FS, HDFS): 72343/3/4/7. > Cache writes (WB, FS, HDFS): 10419/5/0. > Cache times (ACQr/m, RLS, EXP): 387.598/0.039/277.658/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/107. > HOP DAGs recompile time: 0.207 sec. > Functions recompiled: 3. > Functions recompile time: 0.026 sec. > Spark ctx create time (lazy): 36.537 sec. > Spark trans counts (par,bc,col):3/3/0. > Spark trans times (par,bc,col): 0.404/0.147/0.000 secs. > Total JIT compile time: 63.262 sec. > Total JVM GC count: 57. > Total JVM GC time: 34.538 sec. > Heavy hitter instructions (name, time, count): > -- 1) wdivmm 1078.568 sec 5 > -- 2) ba+* 286.854 sec 22 > -- 3) sp_mapmm 37.244 sec 3 > -- 4) fStat_tailprob 2.071 sec 3 > -- 5) rangeReIndex 1.608 sec 30601 > -- 6) == 0.974 sec 11 > -- 7) ^2 0.793 sec 13 > -- 8) cdf 0.603 sec 10200 > -- 9) replace 0.349 sec 10 > -- 10) r' 0.278 sec 106 > 17/02/06 16:09:25 INFO api.DMLScript: END DML run 02/06/2017 16:09:25 > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:363) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:339) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseOrSparseBlock(MatrixBlock.java:346) > at > org.apache.sysml.runtime.matrix.data.LibMatrixMult.matrixMultWDivMM(LibMatrixMult.java:752) > at > org.apache.sysml.runtime.matrix.data.MatrixBlock.quaternaryOperations(MatrixBlock.java:5475) > at > org.apache.sysml.runtime.instructions.cp.QuaternaryCPInstruction.processInstruction(QuaternaryCPInstruction.java:128) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:290) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:221) > at > org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168) > at > org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123) > at org.apache.sysml.api.DMLScript.execute(DMLScript.java:684) > at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:360) > at org.apache.sysml.api.DMLScript.main(DMLScript.java:221) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 17/02/06 16:09:27 INFO util.ShutdownHookManager: Shutdown hook called > 17/02/06 16:09:27 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-6ca71fe9-1f44-4aa9-b57f-83e8ea0b1a33 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)