this might indeed be an robustness issues of rmm which is a replication based matrix multiply operator. I'll have a look. For the meantime, you can increase your driver memory (you currently run w/ 1GB driver, resulting in 700MB local memory budget) to something like 10GB. This would allow a broadcast-based matrix multiply operator (as the broadcast creation requires twice the memory of a matrix, in your case 2.8GB).
Regards, Matthias On Fri, Jun 16, 2017 at 11:52 AM, Anthony Thomas <ahtho...@eng.ucsd.edu> wrote: > Hi Matthias and Glenn, > > Unfortunately I'm still running into this problem with executors crashing > due to OOM. Here's the runtime plan generated by SystemML: > > 17/06/16 18:36:03 INFO DMLScript: EXPLAIN (RUNTIME): > > # Memory Budget local/remote = 628MB/42600MB/63900MB/3195MB > > # Degree of Parallelism (vcores) local/remote = 8/24 > > PROGRAM ( size CP/SP = 17/4 ) > > --MAIN PROGRAM > > ----GENERIC (lines 1-4) [recompile=true] > > ------CP createvar pREADM /scratch/M5.csv false MATRIX csv 18750 18750 -1 > -1 351562500 copy false , true 0.0 > > ------CP createvar _mVar1 scratch_space//_p15260_172.31.3.116//_t0/temp0 > true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy > > ------SPARK csvrblk pREADM.MATRIX.DOUBLE _mVar1.MATRIX.DOUBLE 1000 1000 > false , true 0.0 > > ------CP createvar _mVar2 scratch_space//_p15260_172.31.3.116//_t0/temp1 > true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy > > ------SPARK chkpoint _mVar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE > MEMORY_AND_DISK > > ------CP rmvar _mVar1 > > ------CP createvar _mVar3 scratch_space//_p15260_172.31.3.116//_t0/temp2 > true MATRIX binaryblock 18750 18750 1000 1000 -1 copy > > ------SPARK rmm _mVar2.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE > _mVar3.MATRIX.DOUBLE > > ------CP rmvar _mVar2 > > ------CP cpvar _mVar3 MN > > ------CP rmvar _mVar3 > > ----GENERIC (lines 7-7) [recompile=true] > > ------CP createvar _mVar4 scratch_space//_p15260_172.31.3.116//_t0/temp3 > true MATRIX binaryblock 1 1 1000 1000 -1 copy > > ------SPARK rangeReIndex MN.MATRIX.DOUBLE 1.SCALAR.INT.true > 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar4.MATRIX.DOUBLE > NONE > > ------CP castdts _mVar4.MATRIX.DOUBLE.false _Var5.SCALAR.STRING > > ------CP rmvar _mVar4 > > ------CP print _Var5.SCALAR.STRING.false _Var6.SCALAR.STRING > > ------CP rmvar _Var5 > > ------CP rmvar _Var6 > > ------CP rmvar MN > > ----GENERIC (lines 9-9) [recompile=false] > > ------CP print DONE!.SCALAR.STRING.true _Var7.SCALAR.STRING > > ------CP rmvar _Var7 > > > The actual error reports by the executor is: > > > # There is insufficient memory for the Java Runtime Environment to > continue. > # Native memory allocation (mmap) failed to map 481296384 bytes for > committing reserved memory. > > > I can send my Spark and YARN configurations as well if that would be > useful. Thanks a lot for your help. > > > Best, > > > Anthony > > On Thu, Jun 15, 2017 at 3:00 PM, Anthony Thomas <ahtho...@eng.ucsd.edu> > wrote: > > > Thanks Matthias and Glenn, > > > > I'll give these suggestions a try once I get back in the office tomorrow. > > > > Best, > > > > Anthony > > > > > > On Jun 15, 2017 12:36 PM, "Matthias Boehm" <mboe...@googlemail.com> > wrote: > > > > well, I think Anthony already used -exec spark here; I would recommend to > > (1) fix the driver configuration via --driver-java-options "-Xmn2500m" > (we > > assume that the young generation does not exceed 10% of the max heap > > configuration) - this will help if the OOM comes from the driver, and (2) > > potentially increase the memory overhead of the executors (--conf > > spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node > > manager kill the executor processes because they exceed the container > > limits. If this does not help, please provide the -explain output and we > > have a closer look. > > > > Regards, > > Matthias > > > > On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweid...@us.ibm.com> > > wrote: > > > > > Hi Anthony, > > > > > > Could you retry your scenario without the '-exec spark' option? By > > > default, SystemML will run in hybrid_spark mode which is more > efficient. > > > > > > Thanks, > > > Glenn > > > > > > > > > [image: Inactive hide details for Anthony Thomas ---06/15/2017 09:50:15 > > > AM---Hi SystemML Developers, I'm running the following simple D]Anthony > > > Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm running > > the > > > following simple DML script under SystemML 0.14: > > > > > > From: Anthony Thomas <ahtho...@eng.ucsd.edu> > > > To: dev@systemml.apache.org > > > Date: 06/15/2017 09:50 AM > > > Subject: Unexpected Executor Crash > > > ------------------------------ > > > > > > > > > > > > Hi SystemML Developers, > > > > > > I'm running the following simple DML script under SystemML 0.14: > > > > > > M = read('/scratch/M5.csv') > > > N = read('/scratch/M5.csv') > > > MN = M %*% N > > > if (1 == 1) { > > > print(as.scalar(MN[1,1])) > > > } > > > > > > The matrix M is square and about 5GB on disk (stored in HDFS). I am > > > submitting the script to a 2 node spark cluster where each physical > > machine > > > has 30GB of RAM. I am using the following command to submit the job: > > > > > > $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G > > > --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 > > > --verbose --conf > > > spark.serializer=org.apache.spark.serializer.KryoSerializer > > > $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime > > > > > > However, I consistently run into errors like: > > > > > > ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC > > > client > > > disassociated. Likely due to containers exceeding thresholds, or > network > > > issues. Check driver logs for WARN messages. > > > > > > and the job eventually aborts. Consulting the output of executors shows > > > they are crashing with OutOfMemory exceptions. Even if one executor > > needed > > > to store M,N and MN in memory simultaneously it seems like there should > > be > > > enough memory so I'm unsure why the executor is crashing. In addition, > I > > > was under the impression that Spark would spill to disk if there was > > > insufficient memory. I've tried various combinations of > > > increasing/decreasing the number of executor cores (from 1 to 8), using > > > more/fewer executors, increasing/decreasing Spark's memoryFraction, and > > > increasing/decreasing Spark's default parallelism all without success. > > Can > > > anyone offer any advice or suggestions to debug this issue further? I'm > > not > > > a very experienced Spark user so it's very possible I haven't > configured > > > something correctly. Please let me know if you'd like any further > > > information. > > > > > > Best, > > > > > > Anthony Thomas > > > > > > > > > > > > > > > > > > >