Hi Anthony,

Could you retry your scenario without the '-exec spark' option?  By
default, SystemML will run in hybrid_spark mode which is more efficient.

Thanks,
Glenn




From:   Anthony Thomas <ahtho...@eng.ucsd.edu>
To:     dev@systemml.apache.org
Date:   06/15/2017 09:50 AM
Subject:        Unexpected Executor Crash



Hi SystemML Developers,

I'm running the following simple DML script under SystemML 0.14:

M = read('/scratch/M5.csv')
N = read('/scratch/M5.csv')
MN = M %*% N
if (1 == 1) {
    print(as.scalar(MN[1,1]))
}

The matrix M is square and about 5GB on disk (stored in HDFS). I am
submitting the script to a 2 node spark cluster where each physical machine
has 30GB of RAM. I am using the following command to submit the job:

$SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G
--conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128
--verbose --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer
$SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime

However, I consistently run into errors like:

ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.

and the job eventually aborts. Consulting the output of executors shows
they are crashing with OutOfMemory exceptions. Even if one executor needed
to store M,N and MN in memory simultaneously it seems like there should be
enough memory so I'm unsure why the executor is crashing. In addition, I
was under the impression that Spark would spill to disk if there was
insufficient memory. I've tried various combinations of
increasing/decreasing the number of executor cores (from 1 to 8), using
more/fewer executors, increasing/decreasing Spark's memoryFraction, and
increasing/decreasing Spark's default parallelism all without success. Can
anyone offer any advice or suggestions to debug this issue further? I'm not
a very experienced Spark user so it's very possible I haven't configured
something correctly. Please let me know if you'd like any further
information.

Best,

Anthony Thomas


Reply via email to