Hi all !! I know that all are very busy with giraph 1.2.0, but i hope that someone can help me :)
I'm running a **Giraph** application on **EMR**. I'm using a cluster of 1 master and 10 slaves, all **m3.2xlarge** machines. The application consist, basically, on a BFS through the spanish version of Wikipedia (i adapted the Wikipedia information for fitting on Giraph). I execute the applications in the following way: /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo /tmp/vertices.txt 4 -@- 1 ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4 I can sucessfully run a application with 3 supersteps, but if i want to do 4 supersteps, the application fails, a container gets killed, and the rest die along too. Searching in the Giraph Application Manager, says this: 16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1471231949464_0001_01_000005 16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000009 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000011 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000004 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000010 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000006 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000007 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000008 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000005 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000002 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000012 16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000003 16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=1 16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000008, state=COMPLETE, exitStatus=143, diagnostics=Container [pid=4455,containerID=container_1471231949464_0001_01_000008] is running beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container. Dump of the process-tree for container_1471231949464_0001_01_000008 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187 java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 |- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log 2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion of one conatiner. current status is: completedCount :1 containersToLaunch :11 successfulCount :0 failedCount :1 16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=7 16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000012, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000006, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000007, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) So, appears to be a problem with memory on container 8, but here are the last logs lines of container 8 (remember, it's the container that gets killed): 16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took 23.90834 secs for 10 partitions on superstep 3. Flushing started 16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M / 10619.50M / 10619.50M 16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be <= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received req MBytes = 0, secs waited = 23.332 MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes = 0.4999, secs waited = 23.332 16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548 requests for taskId=10, 504 requests for taskId=0, 251 requests for taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for taskId=8, So, i i understant this correctly, the container have 4516.47M available before doing **flush**, and when doing it, consumes all available of those 4516.47M, and when wants more, gets killed by Giraph AM? I don't understand why needs so much memory doing flush, its basically save the results on disk for the next superstep right? so theoretically shouldn’t need memory at all. Any help will be greatly apreciated. Thanks! Jose