Hi Matthias and Glenn,

Unfortunately I'm still running into this problem with executors crashing
due to OOM. Here's the runtime plan generated by SystemML:

17/06/16 18:36:03 INFO DMLScript: EXPLAIN (RUNTIME):

# Memory Budget local/remote = 628MB/42600MB/63900MB/3195MB

# Degree of Parallelism (vcores) local/remote = 8/24

PROGRAM ( size CP/SP = 17/4 )

--MAIN PROGRAM

----GENERIC (lines 1-4) [recompile=true]

------CP createvar pREADM /scratch/M5.csv false MATRIX csv 18750 18750 -1
-1 351562500 copy false , true 0.0

------CP createvar _mVar1 scratch_space//_p15260_172.31.3.116//_t0/temp0
true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy

------SPARK csvrblk pREADM.MATRIX.DOUBLE _mVar1.MATRIX.DOUBLE 1000 1000
false , true 0.0

------CP createvar _mVar2 scratch_space//_p15260_172.31.3.116//_t0/temp1
true MATRIX binaryblock 18750 18750 1000 1000 351562500 copy

------SPARK chkpoint _mVar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
MEMORY_AND_DISK

------CP rmvar _mVar1

------CP createvar _mVar3 scratch_space//_p15260_172.31.3.116//_t0/temp2
true MATRIX binaryblock 18750 18750 1000 1000 -1 copy

------SPARK rmm _mVar2.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE
_mVar3.MATRIX.DOUBLE

------CP rmvar _mVar2

------CP cpvar _mVar3 MN

------CP rmvar _mVar3

----GENERIC (lines 7-7) [recompile=true]

------CP createvar _mVar4 scratch_space//_p15260_172.31.3.116//_t0/temp3
true MATRIX binaryblock 1 1 1000 1000 -1 copy

------SPARK rangeReIndex MN.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar4.MATRIX.DOUBLE
NONE

------CP castdts _mVar4.MATRIX.DOUBLE.false _Var5.SCALAR.STRING

------CP rmvar _mVar4

------CP print _Var5.SCALAR.STRING.false _Var6.SCALAR.STRING

------CP rmvar _Var5

------CP rmvar _Var6

------CP rmvar MN

----GENERIC (lines 9-9) [recompile=false]

------CP print DONE!.SCALAR.STRING.true _Var7.SCALAR.STRING

------CP rmvar _Var7


The actual error reports by the executor is:


# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 481296384 bytes for
committing reserved memory.


I can send my Spark and YARN configurations as well if that would be
useful. Thanks a lot for your help.


Best,


Anthony

On Thu, Jun 15, 2017 at 3:00 PM, Anthony Thomas <ahtho...@eng.ucsd.edu>
wrote:

> Thanks Matthias and Glenn,
>
> I'll give these suggestions a try once I get back in the office tomorrow.
>
> Best,
>
> Anthony
>
>
> On Jun 15, 2017 12:36 PM, "Matthias Boehm" <mboe...@googlemail.com> wrote:
>
> well, I think Anthony already used -exec spark here; I would recommend to
> (1) fix the driver configuration via --driver-java-options "-Xmn2500m" (we
> assume that the young generation does not exceed 10% of the max heap
> configuration) - this will help if the OOM comes from the driver, and (2)
> potentially increase the memory overhead of the executors (--conf
> spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node
> manager kill the executor processes because they exceed the container
> limits. If this does not help, please provide the -explain output and we
> have a closer look.
>
> Regards,
> Matthias
>
> On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweid...@us.ibm.com>
> wrote:
>
> > Hi Anthony,
> >
> > Could you retry your scenario without the '-exec spark' option? By
> > default, SystemML will run in hybrid_spark mode which is more efficient.
> >
> > Thanks,
> > Glenn
> >
> >
> > [image: Inactive hide details for Anthony Thomas ---06/15/2017 09:50:15
> > AM---Hi SystemML Developers, I'm running the following simple D]Anthony
> > Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm running
> the
> > following simple DML script under SystemML 0.14:
> >
> > From: Anthony Thomas <ahtho...@eng.ucsd.edu>
> > To: dev@systemml.apache.org
> > Date: 06/15/2017 09:50 AM
> > Subject: Unexpected Executor Crash
> > ------------------------------
> >
> >
> >
> > Hi SystemML Developers,
> >
> > I'm running the following simple DML script under SystemML 0.14:
> >
> > M = read('/scratch/M5.csv')
> > N = read('/scratch/M5.csv')
> > MN = M %*% N
> > if (1 == 1) {
> >    print(as.scalar(MN[1,1]))
> > }
> >
> > The matrix M is square and about 5GB on disk (stored in HDFS). I am
> > submitting the script to a 2 node spark cluster where each physical
> machine
> > has 30GB of RAM. I am using the following command to submit the job:
> >
> > $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G
> > --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128
> > --verbose --conf
> > spark.serializer=org.apache.spark.serializer.KryoSerializer
> > $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime
> >
> > However, I consistently run into errors like:
> >
> > ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC
> > client
> > disassociated. Likely due to containers exceeding thresholds, or network
> > issues. Check driver logs for WARN messages.
> >
> > and the job eventually aborts. Consulting the output of executors shows
> > they are crashing with OutOfMemory exceptions. Even if one executor
> needed
> > to store M,N and MN in memory simultaneously it seems like there should
> be
> > enough memory so I'm unsure why the executor is crashing. In addition, I
> > was under the impression that Spark would spill to disk if there was
> > insufficient memory. I've tried various combinations of
> > increasing/decreasing the number of executor cores (from 1 to 8), using
> > more/fewer executors, increasing/decreasing Spark's memoryFraction, and
> > increasing/decreasing Spark's default parallelism all without success.
> Can
> > anyone offer any advice or suggestions to debug this issue further? I'm
> not
> > a very experienced Spark user so it's very possible I haven't configured
> > something correctly. Please let me know if you'd like any further
> > information.
> >
> > Best,
> >
> > Anthony Thomas
> >
> >
> >
> >
>
>
>

Reply via email to