Hello, I'm running into an out-of-memory issue when I attempt to use the Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am trying to generate a heap dump in order to help diagnose the problem but so far I haven't been able to correctly generate a heap dump file. I was wondering if anyone has any advice regarding the out-of-memory issue and creating a heap dump to help diagnose the problem.
I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop batch mode. The master node has NameNode, SecondaryNameNode, and ResourceManager daemons running on it. The 3 other nodes have DataNode and NodeManager daemons running on them. I'm trying out the Kmeans.dml algorithm. To begin, I generated test data using the genRandData4Kmeans.dml script with 100K rows via: hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via: hadoop jar system-ml-0.8.0/SystemML.jar -f system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5 This ran perfectly. However, next I increased the amount of test data to 1M rows, which resulted in matrix data of about 3GB in size: hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx C=C.mtx Y=Y.mtx YbyC=YbyC.mtx I ran Kmeans.dml against the 1M-row X.mtx matrix via: hadoop jar system-ml-0.8.0/SystemML.jar -f system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5 In my console, I received a number of error messages such as: Error: Java heap space 15/11/13 14:48:58 INFO mapreduce.Job: Task Id : attempt_1447452404596_0006_m_000023_1, Status : FAILED Error: GC overhead limit exceeded Next, I attempted to generate a heap dump. Additionally, I added some settings so that I could look at memory usage remotely using JConsole. I added the following lines to my hadoop-env.sh files on each node: export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/ -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}" export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/ -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}" I added the following to my yarn-env.sh files on each node: export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/ -Dcom.sun.management.jmxremote.port=9998 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false ${YARN_RESOURCEMANAGER_OPTS}" export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/ -Dcom.sun.management.jmxremote.port=9998 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}" Additionally, I modified the bin/hadoop file: HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/hadoop2/heapdumps/ -Dcom.sun.management.jmxremote.port=9997 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false" I was able to look at my Java processes remotely in real-time using JConsole. I did not see where the out-of-memory error was happening. Next, I examined the error logs on the 4-nodes. I searched for FATAL entries with the following: $ pwd /home/hadoop2/hadoop-2.6.2/logs $ grep -R FATAL * On the slave nodes, I had log messages such as the following, which seem to indicate the error occurred for the YARN process (NodeManager). userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12 17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded Does anyone have any advice regarding what is causing this error or how I can go about generating a heap dump so I can help diagnose the issue? Thank you, Deron