Hello,

I'm running into an out-of-memory issue when I attempt to use the
Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am trying
to generate a heap dump in order to help diagnose the problem but so far I
haven't been able to correctly generate a heap dump file. I was wondering
if anyone has any advice regarding the out-of-memory issue and creating a
heap dump to help diagnose the problem.

I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server
release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop
batch mode. The master node has NameNode, SecondaryNameNode, and
ResourceManager daemons running on it. The 3 other nodes have DataNode and
NodeManager daemons running on them.

I'm trying out the Kmeans.dml algorithm. To begin, I generated test data
using the genRandData4Kmeans.dml script with 100K rows via:

hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx
C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx

Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via:

hadoop jar system-ml-0.8.0/SystemML.jar -f
system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5

This ran perfectly.

However, next I increased the amount of test data to 1M rows, which
resulted in matrix data of about 3GB in size:

hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx C=C.mtx
Y=Y.mtx YbyC=YbyC.mtx

I ran Kmeans.dml against the 1M-row X.mtx matrix via:

hadoop jar system-ml-0.8.0/SystemML.jar -f
system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5

In my console, I received a number of error messages such as:

Error: Java heap space
15/11/13 14:48:58 INFO mapreduce.Job: Task Id :
attempt_1447452404596_0006_m_000023_1, Status : FAILED
Error: GC overhead limit exceeded

Next, I attempted to generate a heap dump. Additionally, I added some
settings so that I could look at memory usage remotely using JConsole.

I added the following lines to my hadoop-env.sh files on each node:

export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}"

export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}"

I added the following to my yarn-env.sh files on each node:

export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
-Dcom.sun.management.jmxremote.port=9998
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false
${YARN_RESOURCEMANAGER_OPTS}"

export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
-Dcom.sun.management.jmxremote.port=9998
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}"

Additionally, I modified the bin/hadoop file:

HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps/
-Dcom.sun.management.jmxremote.port=9997
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false"

I was able to look at my Java processes remotely in real-time using
JConsole. I did not see where the out-of-memory error was happening.

Next, I examined the error logs on the 4-nodes. I searched for FATAL
entries with the following:

$ pwd
/home/hadoop2/hadoop-2.6.2/logs
$ grep -R FATAL *

On the slave nodes, I had log messages such as the following, which seem to
indicate the error occurred for the YARN process (NodeManager).

userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12
17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running
child : java.lang.OutOfMemoryError: GC overhead limit exceeded

Does anyone have any advice regarding what is causing this error or how I
can go about generating a heap dump so I can help diagnose the issue?

Thank you,

Deron

Reply via email to