[ 
https://issues.apache.org/jira/browse/SYSTEMML-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871492#comment-15871492
 ] 

Matthias Boehm commented on SYSTEMML-1243:
------------------------------------------

Just to clarify our default 100k scenario with 100 features runs fine. This 
scenario here however, uses 1000 features which makes stratstats more 
challenging. I was able to reproduce this OOM, even after the recent changes 
which already reduces memory pressure. 

The core problem comes form several matrix multiplication of the following 
form, where we've chosen mapmm (with repartitioning at runtime level in order 
to overcome Spark's 2GB limitation per partition).  
{code}
mapmm: rdd [100000 x 1000, nnz=95000819, blocks (1000 x 1000)] 800MB
mapmm: bc [1000 x 1000000, nnz=1000000, blocks (1000 x 1000)] 172MB
--> output: 100000 x 1000000
{code}

However, because the RDD has only 100 block - this gives us an upper bound on 
the maximum number of input partitions, hindering us from repartition this RDD 
to our preferred number of partitions which causes too large outputs per task 
(partition). 

I can think of three potential directions going forward:

1) Flip RDD and broadcast during runtime if we detect that it would be 
beneficial for repartitioning (in this case changing the upper bound by 10x)

2) Alternative matrix multiplication operations: Traditionally, we would have 
applied RMM for these scenarios but replication can similarly lead to large 
task outputs. Alternatively, we could consider enabling pmapmm for production 
use. 

3) Extended permutation matrix multiply pmm: So far, we only support selection 
but no permutation matrices and we're only able to detect this within a DAG 
which would not apply here. One option would be to keep track of special 
producing operations and flag intermediates. 

> Perftest: OutOfMemoryError in stratstats.dml for 800MB case
> -----------------------------------------------------------
>
>                 Key: SYSTEMML-1243
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1243
>             Project: SystemML
>          Issue Type: Bug
>          Components: Test
>    Affects Versions: SystemML 0.13
>         Environment: spark 2.1.0
>            Reporter: Imran Younus
>            Assignee: Matthias Boehm
>             Fix For: SystemML 0.13
>
>         Attachments: sparkDML.sh
>
>
> when running {{runAllStats.sh}} script, {{stratstats.dml}} ends with 
> OutOfMemory error for 100k_1k data set. Here is end of log file:
> {code}
> 17/02/06 16:09:25 INFO api.DMLScript: SystemML Statistics:
> Total elapsed time:           1435.880 sec.
> Total compilation time:               2.433 sec.
> Total execution time:         1433.447 sec.
> Number of compiled Spark inst:        190.
> Number of executed Spark inst:        3.
> Cache hits (Mem, WB, FS, HDFS):       72343/3/4/7.
> Cache writes (WB, FS, HDFS):  10419/5/0.
> Cache times (ACQr/m, RLS, EXP):       387.598/0.039/277.658/0.000 sec.
> HOP DAGs recompiled (PRED, SB):       0/107.
> HOP DAGs recompile time:      0.207 sec.
> Functions recompiled:         3.
> Functions recompile time:     0.026 sec.
> Spark ctx create time (lazy): 36.537 sec.
> Spark trans counts (par,bc,col):3/3/0.
> Spark trans times (par,bc,col):       0.404/0.147/0.000 secs.
> Total JIT compile time:               63.262 sec.
> Total JVM GC count:           57.
> Total JVM GC time:            34.538 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)         wdivmm  1078.568 sec    5
> -- 2)         ba+*    286.854 sec     22
> -- 3)         sp_mapmm        37.244 sec      3
> -- 4)         fStat_tailprob  2.071 sec       3
> -- 5)         rangeReIndex    1.608 sec       30601
> -- 6)         ==      0.974 sec       11
> -- 7)         ^2      0.793 sec       13
> -- 8)         cdf     0.603 sec       10200
> -- 9)         replace         0.349 sec       10
> -- 10)        r'      0.278 sec       106
> 17/02/06 16:09:25 INFO api.DMLScript: END DML run 02/06/2017 16:09:25
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:363)
>       at 
> org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseBlock(MatrixBlock.java:339)
>       at 
> org.apache.sysml.runtime.matrix.data.MatrixBlock.allocateDenseOrSparseBlock(MatrixBlock.java:346)
>       at 
> org.apache.sysml.runtime.matrix.data.LibMatrixMult.matrixMultWDivMM(LibMatrixMult.java:752)
>       at 
> org.apache.sysml.runtime.matrix.data.MatrixBlock.quaternaryOperations(MatrixBlock.java:5475)
>       at 
> org.apache.sysml.runtime.instructions.cp.QuaternaryCPInstruction.processInstruction(QuaternaryCPInstruction.java:128)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:290)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:221)
>       at 
> org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:168)
>       at 
> org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
>       at org.apache.sysml.api.DMLScript.execute(DMLScript.java:684)
>       at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:360)
>       at org.apache.sysml.api.DMLScript.main(DMLScript.java:221)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 17/02/06 16:09:27 INFO util.ShutdownHookManager: Shutdown hook called
> 17/02/06 16:09:27 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-6ca71fe9-1f44-4aa9-b57f-83e8ea0b1a33
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to