Thanks Matthias and Glenn, I'll give these suggestions a try once I get back in the office tomorrow.
Best, Anthony On Jun 15, 2017 12:36 PM, "Matthias Boehm" <mboe...@googlemail.com> wrote: well, I think Anthony already used -exec spark here; I would recommend to (1) fix the driver configuration via --driver-java-options "-Xmn2500m" (we assume that the young generation does not exceed 10% of the max heap configuration) - this will help if the OOM comes from the driver, and (2) potentially increase the memory overhead of the executors (--conf spark.yarn.executor.memoryOverhead=10240) if ran on yarn and the node manager kill the executor processes because they exceed the container limits. If this does not help, please provide the -explain output and we have a closer look. Regards, Matthias On Thu, Jun 15, 2017 at 10:15 AM, Glenn Weidner <gweid...@us.ibm.com> wrote: > Hi Anthony, > > Could you retry your scenario without the '-exec spark' option? By > default, SystemML will run in hybrid_spark mode which is more efficient. > > Thanks, > Glenn > > > [image: Inactive hide details for Anthony Thomas ---06/15/2017 09:50:15 > AM---Hi SystemML Developers, I'm running the following simple D]Anthony > Thomas ---06/15/2017 09:50:15 AM---Hi SystemML Developers, I'm running the > following simple DML script under SystemML 0.14: > > From: Anthony Thomas <ahtho...@eng.ucsd.edu> > To: dev@systemml.apache.org > Date: 06/15/2017 09:50 AM > Subject: Unexpected Executor Crash > ------------------------------ > > > > Hi SystemML Developers, > > I'm running the following simple DML script under SystemML 0.14: > > M = read('/scratch/M5.csv') > N = read('/scratch/M5.csv') > MN = M %*% N > if (1 == 1) { > print(as.scalar(MN[1,1])) > } > > The matrix M is square and about 5GB on disk (stored in HDFS). I am > submitting the script to a 2 node spark cluster where each physical machine > has 30GB of RAM. I am using the following command to submit the job: > > $SPARK_HOME/bin/spark-submit --driver-memory=5G --executor-memory=25G > --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 > --verbose --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer > $SYSTEMML_HOME/SystemML.jar -f example.dml -exec spark -explain runtime > > However, I consistently run into errors like: > > ERROR TaskSchedulerImpl: Lost executor 1 on 172.31.3.116: Remote RPC > client > disassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. > > and the job eventually aborts. Consulting the output of executors shows > they are crashing with OutOfMemory exceptions. Even if one executor needed > to store M,N and MN in memory simultaneously it seems like there should be > enough memory so I'm unsure why the executor is crashing. In addition, I > was under the impression that Spark would spill to disk if there was > insufficient memory. I've tried various combinations of > increasing/decreasing the number of executor cores (from 1 to 8), using > more/fewer executors, increasing/decreasing Spark's memoryFraction, and > increasing/decreasing Spark's default parallelism all without success. Can > anyone offer any advice or suggestions to debug this issue further? I'm not > a very experienced Spark user so it's very possible I haven't configured > something correctly. Please let me know if you'd like any further > information. > > Best, > > Anthony Thomas > > > >